Enabling Usage of multiple GPUs if you encounter stalling at "training __ kimgs ..."

KryptixOne commented 1 year ago

When attempting to run this implementation of StyleGAN3, you may run into stalling when attempting to train. This is due to the initial GPU:0 locking out access to bias_act.pyd and similiar StyleGAN3 custom cuda kernels plugins.

To fix this and enable training, you must create duplicate but standalone files of these plugins.

I did this in the following way for 2 GPUs:

in the custom_ops.py file starting after line 138:


            try:
                torch.utils.cpp_extension.load(name=module_name, build_directory=cached_build_dir,
                    verbose=verbose_build, sources=cached_sources, **build_kwargs)
                module = importlib.import_module(module_name)
            except:
                print('forcing secondary ' + module_name)

                torch.utils.cpp_extension.load(name=module_name+'_2', build_directory=cached_build_dir,
                                               verbose=verbose_build, sources=cached_sources, **build_kwargs)
                module = importlib.import_module(module_name+'_2')

Now navigate to the following directory:

C:\Users\"YourUserName"\AppData\Local\torch_extensions\torch_extensions\Cache\py38_cu116\

Here, you will find 3 plugin folders that were created during your initial attempt. Create copies of these folders in the same directory but add "_2" to the end of each folder name.

Now, when you run the model with multiple GPUs, once the model encounters the situation where a plugin load has failed, it will try the secondary plugin that you have just created and will succeed.

Feel free to automate this process if you are running with GPUs >2 but this workaround should work. Note that the initial setup may take about 1-5min longer than before.

nuclearsugar commented 1 year ago

Would you be able to share your edited version of custom_ops.py? I'm having some trouble getting it to work and think I'm missing something.

KryptixOne commented 1 year ago

Hi @nuclearsugar,

Sure the version I used is here

I should mention that I noticed that when importing the C++ extensions, if you stop the script during import, it can become corrupted. At this point, you will need to clear your torch_extensions folder. Build them again with GPU=1, then, once they are built. Follow the above instructions and run again with GPU=2

nuclearsugar commented 1 year ago

Thanks for sharing the edited version of custom_ops.py. That fixed it!

NVlabs / stylegan3

Enabling Usage of multiple GPUs if you encounter stalling at "training __ kimgs ..." #218