How can I use single node and single GPU

marcomameli1992 commented 3 years ago

Dear, I would like to use my single GPU on my personal computer to test your code. Can you explain how to reproduce the training or test configuration.

mathildecaron31 commented 3 years ago

Hi @marcomameli1992 You can train on 1 GPU with the following command: python -m torch.distributed.launch --nproc_per_node=1 main_swav.py

I recommend you adapt the learning rate to your actual batch size. Let me know if you need further guidance for any particular argument of the method.

lakerwe commented 3 years ago

Hi @marcomameli1992 You can train on 1 GPU with the following command: python -m torch.distributed.launch --nproc_per_node=1 main_swav.py

I recommend you adapt the learning rate to your actual batch size. Let me know if you need further guidance for any particular argument of the method.

hello, could you tell me your configuration when training on COCO dataset. i trained on one gpu,set batchsize 64, the loss goes down to about 7.2 in first 20 epoch, but then bounds back to 8.006.

mathildecaron31 commented 3 years ago

Are you introducing the queue at epoch 20 ?

You can have a look at https://github.com/facebookresearch/swav#common-issues

tom-bu commented 3 years ago

If we're just using one GPU, then do we need torch.distributed.launch and can we then avoid using apex?

mathildecaron31 commented 3 years ago

Hi @tom-bu

Yes, even when training with only one GPU you need to launch the code with python -m torch.distributed.launch --nproc_per_node=1 main_swav.py

At the moment, the default is to use apex but you could easily remove any dependencies to apex. For example remove the following lines: https://github.com/facebookresearch/swav/blob/9a2dc8073884c11de691ffe734bd624a84ccd96d/main_swav.py#L22-L23 https://github.com/facebookresearch/swav/blob/9a2dc8073884c11de691ffe734bd624a84ccd96d/main_swav.py#L159-L163 https://github.com/facebookresearch/swav/blob/9a2dc8073884c11de691ffe734bd624a84ccd96d/main_swav.py#L177

replace https://github.com/facebookresearch/swav/blob/9a2dc8073884c11de691ffe734bd624a84ccd96d/main_swav.py#L348 with lr=optimizer.param_groups[0]["lr"],

Finally I would recommend using a lower learning rate without LARC (for example 0.03 for a total batch size of 256) and larger weight decay (1e-4).

Hope that helps

Whiax commented 3 years ago

Hi @mathildecaron31, On a similar "issue", I would like to know if you have any recommendations for speeding up the training on one middle-end GPU (RTX2070S). I think that even the small batch trainings are made with 4 (V100?) GPUs. 102H with 4*V100 translates to 34 days of training on one middle-end GPU which is a lot for some configurations. I think I understood (from the paper) that this is SOTA compared to SimCLR/Moco, even for small batch training, but are there things you can recommend to speed up the training on one GPU even if it means reducing the final accuracy. Things like:

using adam / radam / other optimizers? Do you recommend removing LARC/LARS for small batch training (when keeping apex)?
using other hyperparameters? larger queue size? smaller epsilon?
using smaller images for a higher batch size?
using a smaller size for prototypes? a higher size for the number of features?

The constraint would be: "How to reach the best accuracy with resnet50 and one middle-end GPU in 3 days?", even if the best I can get is 50% top1. I'm already trying things (I'm at ~30% Top1 in 15 epochs on one GPU) but you're surely more experienced on this model. In supervised learning I've managed to do things like that on my middle-end GPU.

Thanks if your have the time to answer!

JrPeng commented 3 years ago

Hi @mathildecaron31, On a similar "issue", I would like to know if you have any recommendations for speeding up the training on one middle-end GPU (RTX2070S). I think that even the small batch trainings are made with 4 (V100?) GPUs. 102H with 4*V100 translates to 34 days of training on one middle-end GPU which is a lot for some configurations. I think I understood (from the paper) that this is SOTA compared to SimCLR/Moco, even for small batch training, but are there things you can recommend to speed up the training on one GPU even if it means reducing the final accuracy. Things like:

using adam / radam / other optimizers? Do you recommend removing LARC/LARS for small batch training (when keeping apex)?

using other hyperparameters? larger queue size? smaller epsilon?

using smaller images for a higher batch size?

using a smaller size for prototypes? a higher size for the number of features?

The constraint would be: "How to reach the best accuracy with resnet50 and one middle-end GPU in 3 days?", even if the best I can get is 50% top1. I'm already trying things (I'm at ~30% Top1 in 15 epochs on one GPU) but you're surely more experienced on this model. In supervised learning I've managed to do things like that on my middle-end GPU.

Thanks if your have the time to answer!

Hi，have you tried the version of removing LARS(like using SGD)?

Whiax commented 3 years ago

Hi @JrPeng thanks for the reply. I've tried many hyperparameters including removing LARS, I didn't see a clear improvement doing that (maybe I haven't done it well). I'm around ~50% top 1 after ~50 epochs now (mono GPU) which is the target I aimed on my configuration, the main tips I could give to someone training on one GPU:

Use a smaller network, resnet34 or resnet18
Raise the batch size (>100), even if it means reducing nmb prototypes and other stuffs (don't reduce embedding size though)
Tricks that improve resnet also work here (SE / ECA / CBAM, SiLU, ResNeXT etc), I don't recommend training EfficientNet+swav on one gpu
Tune epsilon and temperature a bit (that's already recommended in the Readme), and don't forget to tune lr after changing network and batch size

My goal is also "fast training", I never go after 50 epochs. Two more things I did that improved the results for training faster: My training is usually stable when I don't freeze the prototypes (it seems that it can work in some configurations) + I'm slowly increasing the queue size depending on how fast the algorithm is learning ("slowly improving" => "raising queue size"), right after 1000 iterations. I measure how fast the algorithm is learning during training by measuring if the closest prototype of one image in a batch is another prototype of the same image (loss isn't relevant even normalized by nmb prototypes, epsilon changes the task), I do that over the entire batch and I average.

JrPeng commented 3 years ago

@Whiax , thank you for the reply. This is amazing.

RylanSchaeffer commented 2 years ago

@Whiax do you have a modified repo you could share?

Whiax commented 2 years ago

@Whiax do you have a modified repo you could share?

No I don't have a modified repo that I could quickly share sorry @RylanSchaeffer. I modified the code a lot for my own application and it would take some time to re-make a version that could be public.

facebookresearch / swav

How can I use single node and single GPU #56