High memory usage even with low batch size and support set size

facebookresearch / suncet

Code to reproduce the results in the FAIR research papers "Semi-Supervised Learning of Visual Features by Non-Parametrically Predicting View Assignments with Support Samples" https://arxiv.org/abs/2104.13963 and "Supervision Accelerates Pre-training in Contrastive Semi-Supervised Learning of Visual Representations" https://arxiv.org/abs/2006.10803

MIT License

486 stars 67 forks source link

High memory usage even with low batch size and support set size #27

Closed klnavaneet closed 2 years ago

klnavaneet commented 2 years ago

Hi. I am trying to replicate the method on a 4 (24G each) GPU system. According to table 7 of the paper, it is possible to run with the following config on 8 16G GPUs: unsupervised_batch_size=32, classes_per_batch=70, supervised_imgs_per_cls=3, multicrop=6, unique_classes_per_rank=true

I use the following configuration: unsupervised_batch_size=32, classes_per_batch=35, supervised_imgs_per_cls=3, multicrop=6, unique_classes_per_rank=true

This setting occupies nearly 24GB on each of the 4 gpus. Is there any reason why I see such a high memory usage despite using lower configuration?

MidoAssran commented 2 years ago

hi @klnavaneet, are you using mixed precision training (i.e., setting use_fp16: true in your config)?

MidoAssran commented 2 years ago

Yes just tried and unsupervised_batch_size=32, classes_per_batch=70, supervised_imgs_per_cls=3, multicrop=6, unique_classes_per_rank=true fits for me on a 16G-GPU with a ResNet50, but it is important to set use_fp16: true

Please try this and let me know if you still have issues.

klnavaneet commented 2 years ago

Hi, thanks for the quick reply! I do have use_fp: true, but I get an apex related warning:

Warning: using Python fallback for SyncBatchNorm, possibly because apex was installed without --cuda_ext. The exception raised when attempting to import the cuda backend was: No module named 'syncbn'

Could this be the sole reason? I also see that a single epoch training with the above config takes >4 hours on RTX 6000 gpus.

MidoAssran commented 2 years ago

I think this warning is saying that you installed apex without the cuda extension, but you need it to run the code, so I think this could be it? Maybe try uninstalling apex, and then re-installing it with the cuda extension, and see if you still get the issue?

klnavaneet commented 2 years ago

I am facing compatibility issues with CUDA and apex and currently unable to eliminate the warning. I will close the issue if I cannot solve it soon. Thanks for helping out!

MidoAssran commented 2 years ago

oh I see, in the meantime i'll look into removing NVIDIA-apex from the code to use the newer version of PyTorch tools, maybe this could help you

klnavaneet commented 2 years ago

oh I see, in the meantime i'll look into removing NVIDIA-apex from the code to use the newer version of PyTorch tools, maybe this could help you

That would be great, thanks! I faced a number of dependency issues with apex, this would be helpful

I was able to resolve the apex warning, the memory usage and training times are now similar to what were reported. Thanks for the help.

amandalucasp commented 1 year ago

hi @klnavaneet! could you explain how you solved the apex warning? i've started using a new cluster and i'm having the same issues

klnavaneet commented 1 year ago

@amandalucasp, I think I had a version mismatch between the cuda and apex installations and correcting that fixed the issue.

amandalucasp commented 1 year ago

@klnavaneet yeah, from what I read online is the same issue i'm having. thanks for the response =)