Closed klnavaneet closed 2 years ago
hi @klnavaneet, are you using mixed precision training (i.e., setting use_fp16: true
in your config)?
Yes just tried and unsupervised_batch_size=32, classes_per_batch=70, supervised_imgs_per_cls=3, multicrop=6, unique_classes_per_rank=true
fits for me on a 16G-GPU with a ResNet50, but it is important to set use_fp16: true
Please try this and let me know if you still have issues.
Hi, thanks for the quick reply! I do have use_fp: true
, but I get an apex related warning:
Warning: using Python fallback for SyncBatchNorm, possibly because apex was installed without --cuda_ext. The exception raised when attempting to import the cuda backend was: No module named 'syncbn'
Could this be the sole reason? I also see that a single epoch training with the above config takes >4 hours on RTX 6000 gpus.
I think this warning is saying that you installed apex without the cuda extension, but you need it to run the code, so I think this could be it? Maybe try uninstalling apex, and then re-installing it with the cuda extension, and see if you still get the issue?
I am facing compatibility issues with CUDA and apex and currently unable to eliminate the warning. I will close the issue if I cannot solve it soon. Thanks for helping out!
oh I see, in the meantime i'll look into removing NVIDIA-apex from the code to use the newer version of PyTorch tools, maybe this could help you
oh I see, in the meantime i'll look into removing NVIDIA-apex from the code to use the newer version of PyTorch tools, maybe this could help you
That would be great, thanks! I faced a number of dependency issues with apex, this would be helpful
I was able to resolve the apex warning, the memory usage and training times are now similar to what were reported. Thanks for the help.
hi @klnavaneet! could you explain how you solved the apex warning? i've started using a new cluster and i'm having the same issues
@amandalucasp, I think I had a version mismatch between the cuda and apex installations and correcting that fixed the issue.
@klnavaneet yeah, from what I read online is the same issue i'm having. thanks for the response =)
Hi. I am trying to replicate the method on a 4 (24G each) GPU system. According to table 7 of the paper, it is possible to run with the following config on 8 16G GPUs:
unsupervised_batch_size=32, classes_per_batch=70, supervised_imgs_per_cls=3, multicrop=6, unique_classes_per_rank=true
I use the following configuration:
unsupervised_batch_size=32, classes_per_batch=35, supervised_imgs_per_cls=3, multicrop=6, unique_classes_per_rank=true
This setting occupies nearly 24GB on each of the 4 gpus. Is there any reason why I see such a high memory usage despite using lower configuration?