How to run with multi-GPU?

ssean819 commented 3 years ago

Hi, I want to try to run multi-GPU. But when I set GPU number bigger than 1. It output

Warning: THIS FUNCTION IS DEPRECATED. It will be removed after 2020-04-01. Instructions for updating: Use tf.distribute.MirroredStrategy instead.

And training would stop in the epoch 1.

It seems we need to use MirroredStrategy for multi GPU now. https://www.tensorflow.org/api_docs/python/tf/distribute/MirroredStrategy

Will the next version update with MirroredStrategy? I am finding a way to modify code with the use of MirroredStrategy instead.

Best regard.

muellerdo commented 3 years ago

Hey @ssean819,

you are absolutely right. Thank you for spotting this deprecated functionality.

I will replace the Keras multi_gpu model with the Tensorflow MirroredStrategy and release it in the next update when its tested & ready.

Cheers, Dominik

Tasks

[x] Replaced Keras multi_gpu with Tensorflow MirroredStrategy
[x] Implemented unittesting for multi-GPU
[x] Tested new feature
[x] Changed parameter of Neural_Network class from to multi_gpu (boolean)
[x] Updated wiki and old example code
[x] Merged dev branch into Master
[x] Release new PyPI version

Related Commits: 1eb0a95d345a15f409e5ea764709893deb6a627c, a36716c8cc287b6e387101fbe7aed7e08c831216, f70d2b5c8368a0f52181495cea100243ea6a1cf2

Notes

You can now use MirroredStrategy in MIScnn if you run something like this:

# Multi GPU utilization
nn = Neural_Network(preprocessor=pp, multi_gpu=True)
nn.train(self.sample_list2D, epochs=3)

ssean819 commented 3 years ago

Hi, Thank you a lot for update multi-GPU function. But when I try to install miscnn1.1.0. It seems missing some files. The problem is below.

Collecting miscnn
  Using cached miscnn-1.1.0.tar.gz (55 kB)
    ERROR: Command errored out with exit status 1:
     command: 'C:\Users\sean\anaconda3\envs\py3.8\python.exe' -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\sean\\AppData\\Local\\Temp\\pip-install-5fjrmc1o\\miscnn\\setup.py'"'"'; __file__='"'"'C:\\Users\\sean\\AppData\\Local\\Temp\\pip-install-5fjrmc1o\\miscnn\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base 'C:\Users\sean\AppData\Local\Temp\pip-pip-egg-info-37htbhms'
         cwd: C:\Users\sean\AppData\Local\Temp\pip-install-5fjrmc1o\miscnn\
    Complete output (5 lines):
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "C:\Users\sean\AppData\Local\Temp\pip-install-5fjrmc1o\miscnn\setup.py", line 5, in <module>
        with open("docs/README.PyPI.md", "r") as fh:
    FileNotFoundError: [Errno 2] No such file or directory: 'docs/README.PyPI.md'
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

I think the problem is gz is not converted to whl.

ssean819 commented 3 years ago

Hi @muellerdo I think maybe someone would have NCCL problem when using multi-GPU. error info is like below error: No OpKernel was registered to support Op 'NcclAllReduce'

Because tf.distribute.MirroredStrategy()uses NCCL in default. We can change to tf.distribute.mirrorstrategy(cross_device_ops=tf.distribute.HierarchicalCopyAllReduce()) This solves no NCCL problem.

muellerdo commented 3 years ago

Hey @ssean819,

But when I try to install miscnn1.1.0. It seems missing some files.

You are right. The wheel was missing on PyPI for some reasons :O I uploaded it again and it should work now.

I think maybe someone would have NCCL problem when using multi-GPU. error info is like below

Thanks for the feedback! Will be changed to HierarchicalCopyAllReduce in the next version.

Cheers, Dominik

Tasks

[x] Change NCCL to HierarchicalCopyAllReduce for Mirrored Strategy
[x] Tested locally and on TravisCI node
[x] Merged branch into Master
[x] Release new PyPI version

Related Commits: 68eb07dd80fd5bb2f98dc8a2d07134dbe8dc3be6

ssean819 commented 3 years ago

Hi @muellerdo

Now I test with multi-GPU occur this problem.

F .\tensorflow/core/kernels/conv_2d_gpu.h:1021] Non-OK-status: GpuLaunchKernel( SwapDimension1And2InTensor3UsingTiles<T, kNumThreads, kTileSize, kTileSize, conjugate>, total_tiles_count, kNumThreads, 0, d.stream(), input, input_dims, output) status: Internal: invalid configuration argument

It seems is tensorflow's problem, I am not sure. Do you know how to fix this? I am trying to find a solution.

muellerdo commented 3 years ago

Hi @ssean819,

you are correct. This is a Tensorflow issue. Sadly I'm unfamiliar with this error.

Nevertheless, these two issues suggest that it could has something to do with:

Upgrading/Downgrading Tensorflow/cuDNN/CUDA version -> https://stackoverflow.com/questions/63258022/non-ok-status-gpulaunchkernel-status-internal-no-kernel-image-is-availab
Odd batch numbers when dividing with the number of gpus -> https://github.com/tensorflow/tensorflow/issues/36310

I tried to reproduce the error when using odd batch numbers with 3 gpus (batch size 10), but it works fine for me on the latest stable tensorflow docker image and 3x NVIDIA TITAN RTX. Are you working on a Windows system?

Cheers, Dominik

tslee69 commented 3 years ago

When I tried the MIScnn sample example (LCTSC) with multi-GPU option 'on' (Neural_Network(multi_gpu=True)), I got the following message right before the Epoch 1 and the kernel restarted. Then you cannot run it anymore. There is no modification in the sample code except the multi-GPU option. Is there any solution for using multi-GPU in MIScnn? I am using A100 GPUs with the latest versions of MIScnn, CUDA, and cuDNN. Thank you!!

Consider either turning off auto-sharding or switching the auto_shard_policy to DATA to shard this dataset. You can do this by creating a new tf.data.Options() object then setting options.experimental_distribute.auto_shard_policy = AutoShardPolicy.DATA before applying the options object to the dataset via dataset.with_options(options). I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2) I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 1999955000 Hz

Epoch 1/100 INFO:tensorflow:batch_all_reduce: 82 all-reduces with algorithm = hierarchical_copy, num_packs = 1 INFO:tensorflow:batch_all_reduce: 82 all-reduces with algorithm = hierarchical_copy, num_packs = 1

Kernel Restarting - The kernel for LCTSC.ipynb appears to have died. It will restart automatically.

muellerdo commented 3 years ago

@tslee69, seems like tensorflow added some more issues since version 2.4.0 to its multi-gpu support for keras :/

Check out this:

frankkramer-lab / MIScnn