check-face / checkface

Putting a face to a hash
https://checkface.facemorph.me
Other
25 stars 13 forks source link

Nccl Library Unavailable on windows #33

Open cdilga opened 4 years ago

cdilga commented 4 years ago

Training StyleGAN on multiple GPUs requires Nccl, which is not included on windows. There is some custom way of reducing and updating all of the gradients across the devices which is not similar to the api's exposed by tensorflow.

This causes an error like:

tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'NcclAllReduce' used by node TrainD/SumAcrossGPUs/NcclAllReduce (defined at D:\data\oliver-train-checkface\fflowhq\00005-sgan-flower-1gpu\src\dnnlib\tflib\optimizer.py:135) with these attrs: [reduction="sum", shared_name="c124", T=DT_FLOAT, num_devices=2]

There is no drop in replacement that has been found, because the api for tf generic operations like a HierachicalAllReduce which is used in Keras like in: https://github.com/tensorflow/tensorflow/issues/21470 is not compatible with the nccl_ops.py interface https://github.com/tensorflow/tensorflow/blob/r1.14/tensorflow/python/ops/nccl_ops.py

Perhaps even more surprising is the fact that other ops, like: collective_ops.py https://github.com/tensorflow/tensorflow/blob/r1.14/tensorflow/python/ops/collective_ops.py do not provide drop in replacements. These ops seem to have completely different use cases as is made clear by their use in tests: https://github.com/tensorflow/tensorflow/blob/r1.14/tensorflow/python/ops/nccl_ops_test.py https://github.com/tensorflow/tensorflow/blob/r1.14/tensorflow/python/ops/collective_ops_test.py

The line that needs to be updated or removed seems to be the following: https://github.com/check-face/checkface/blob/a88dab03b5803c8c020279bb1d5ab556fc1c3665/src/server/dnnlib/tflib/optimizer.py#L135 This is the point at which all of the device gradients are summed together before updating each of the devices. However, higher level api's like HierarchicalAllReduce would handle this entire process, including the updating of each of the devices, but is not well suited to this use case.

@olivercoad