Open ssean819 opened 3 years ago
Hey @ssean819,
you are absolutely right. Thank you for spotting this deprecated functionality.
I will replace the Keras multi_gpu model with the Tensorflow MirroredStrategy and release it in the next update when its tested & ready.
Cheers, Dominik
Related Commits: 1eb0a95d345a15f409e5ea764709893deb6a627c, a36716c8cc287b6e387101fbe7aed7e08c831216, f70d2b5c8368a0f52181495cea100243ea6a1cf2
You can now use MirroredStrategy in MIScnn if you run something like this:
# Multi GPU utilization
nn = Neural_Network(preprocessor=pp, multi_gpu=True)
nn.train(self.sample_list2D, epochs=3)
Hi, Thank you a lot for update multi-GPU function. But when I try to install miscnn1.1.0. It seems missing some files. The problem is below.
Collecting miscnn
Using cached miscnn-1.1.0.tar.gz (55 kB)
ERROR: Command errored out with exit status 1:
command: 'C:\Users\sean\anaconda3\envs\py3.8\python.exe' -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\sean\\AppData\\Local\\Temp\\pip-install-5fjrmc1o\\miscnn\\setup.py'"'"'; __file__='"'"'C:\\Users\\sean\\AppData\\Local\\Temp\\pip-install-5fjrmc1o\\miscnn\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base 'C:\Users\sean\AppData\Local\Temp\pip-pip-egg-info-37htbhms'
cwd: C:\Users\sean\AppData\Local\Temp\pip-install-5fjrmc1o\miscnn\
Complete output (5 lines):
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "C:\Users\sean\AppData\Local\Temp\pip-install-5fjrmc1o\miscnn\setup.py", line 5, in <module>
with open("docs/README.PyPI.md", "r") as fh:
FileNotFoundError: [Errno 2] No such file or directory: 'docs/README.PyPI.md'
----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
I think the problem is gz is not converted to whl.
Hi @muellerdo
I think maybe someone would have NCCL problem when using multi-GPU.
error info is like below
error: No OpKernel was registered to support Op 'NcclAllReduce'
Because tf.distribute.MirroredStrategy()
uses NCCL in default.
We can change to tf.distribute.mirrorstrategy(cross_device_ops=tf.distribute.HierarchicalCopyAllReduce())
This solves no NCCL problem.
Hey @ssean819,
But when I try to install miscnn1.1.0. It seems missing some files.
You are right. The wheel was missing on PyPI for some reasons :O I uploaded it again and it should work now.
I think maybe someone would have NCCL problem when using multi-GPU. error info is like below
Thanks for the feedback! Will be changed to HierarchicalCopyAllReduce in the next version.
Cheers, Dominik
Related Commits: 68eb07dd80fd5bb2f98dc8a2d07134dbe8dc3be6
Hi @muellerdo
Now I test with multi-GPU occur this problem.
F .\tensorflow/core/kernels/conv_2d_gpu.h:1021] Non-OK-status: GpuLaunchKernel( SwapDimension1And2InTensor3UsingTiles<T, kNumThreads, kTileSize, kTileSize, conjugate>, total_tiles_count, kNumThreads, 0, d.stream(), input, input_dims, output) status: Internal: invalid configuration argument
It seems is tensorflow's problem, I am not sure. Do you know how to fix this? I am trying to find a solution.
Hi @ssean819,
you are correct. This is a Tensorflow issue. Sadly I'm unfamiliar with this error.
Nevertheless, these two issues suggest that it could has something to do with:
I tried to reproduce the error when using odd batch numbers with 3 gpus (batch size 10), but it works fine for me on the latest stable tensorflow docker image and 3x NVIDIA TITAN RTX. Are you working on a Windows system?
Cheers, Dominik
When I tried the MIScnn sample example (LCTSC) with multi-GPU option 'on' (Neural_Network(multi_gpu=True)), I got the following message right before the Epoch 1 and the kernel restarted. Then you cannot run it anymore. There is no modification in the sample code except the multi-GPU option. Is there any solution for using multi-GPU in MIScnn? I am using A100 GPUs with the latest versions of MIScnn, CUDA, and cuDNN. Thank you!!
Consider either turning off auto-sharding or switching the auto_shard_policy to DATA to shard this dataset. You can do this by creating a new tf.data.Options()
object then setting options.experimental_distribute.auto_shard_policy = AutoShardPolicy.DATA
before applying the options object to the dataset via dataset.with_options(options)
.
I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 1999955000 Hz
Epoch 1/100 INFO:tensorflow:batch_all_reduce: 82 all-reduces with algorithm = hierarchical_copy, num_packs = 1 INFO:tensorflow:batch_all_reduce: 82 all-reduces with algorithm = hierarchical_copy, num_packs = 1
Kernel Restarting - The kernel for LCTSC.ipynb appears to have died. It will restart automatically.
@tslee69, seems like tensorflow added some more issues since version 2.4.0 to its multi-gpu support for keras :/
Check out this:
Hi, I want to try to run multi-GPU. But when I set GPU number bigger than 1. It output
Warning: THIS FUNCTION IS DEPRECATED. It will be removed after 2020-04-01. Instructions for updating: Use tf.distribute.MirroredStrategy instead.
And training would stop in the epoch 1.
It seems we need to use MirroredStrategy for multi GPU now. https://www.tensorflow.org/api_docs/python/tf/distribute/MirroredStrategy
Will the next version update with MirroredStrategy? I am finding a way to modify code with the use of MirroredStrategy instead.
Best regard.