Closed marcomameli1992 closed 3 years ago
Hi @marcomameli1992
You can train on 1 GPU with the following command:
python -m torch.distributed.launch --nproc_per_node=1 main_swav.py
I recommend you adapt the learning rate to your actual batch size. Let me know if you need further guidance for any particular argument of the method.
Hi @marcomameli1992 You can train on 1 GPU with the following command:
python -m torch.distributed.launch --nproc_per_node=1 main_swav.py
I recommend you adapt the learning rate to your actual batch size. Let me know if you need further guidance for any particular argument of the method.
hello, could you tell me your configuration when training on COCO dataset. i trained on one gpu,set batchsize 64, the loss goes down to about 7.2 in first 20 epoch, but then bounds back to 8.006.
Are you introducing the queue at epoch 20 ?
You can have a look at https://github.com/facebookresearch/swav#common-issues
If we're just using one GPU, then do we need torch.distributed.launch and can we then avoid using apex?
Hi @tom-bu
Yes, even when training with only one GPU you need to launch the code with python -m torch.distributed.launch --nproc_per_node=1 main_swav.py
At the moment, the default is to use apex but you could easily remove any dependencies to apex. For example remove the following lines: https://github.com/facebookresearch/swav/blob/9a2dc8073884c11de691ffe734bd624a84ccd96d/main_swav.py#L22-L23 https://github.com/facebookresearch/swav/blob/9a2dc8073884c11de691ffe734bd624a84ccd96d/main_swav.py#L159-L163 https://github.com/facebookresearch/swav/blob/9a2dc8073884c11de691ffe734bd624a84ccd96d/main_swav.py#L177
replace https://github.com/facebookresearch/swav/blob/9a2dc8073884c11de691ffe734bd624a84ccd96d/main_swav.py#L348
with lr=optimizer.param_groups[0]["lr"],
Finally I would recommend using a lower learning rate without LARC (for example 0.03 for a total batch size of 256) and larger weight decay (1e-4).
Hope that helps
Hi @mathildecaron31, On a similar "issue", I would like to know if you have any recommendations for speeding up the training on one middle-end GPU (RTX2070S). I think that even the small batch trainings are made with 4 (V100?) GPUs. 102H with 4*V100 translates to 34 days of training on one middle-end GPU which is a lot for some configurations. I think I understood (from the paper) that this is SOTA compared to SimCLR/Moco, even for small batch training, but are there things you can recommend to speed up the training on one GPU even if it means reducing the final accuracy. Things like:
The constraint would be: "How to reach the best accuracy with resnet50 and one middle-end GPU in 3 days?", even if the best I can get is 50% top1. I'm already trying things (I'm at ~30% Top1 in 15 epochs on one GPU) but you're surely more experienced on this model. In supervised learning I've managed to do things like that on my middle-end GPU.
Thanks if your have the time to answer!
Hi @mathildecaron31, On a similar "issue", I would like to know if you have any recommendations for speeding up the training on one middle-end GPU (RTX2070S). I think that even the small batch trainings are made with 4 (V100?) GPUs. 102H with 4*V100 translates to 34 days of training on one middle-end GPU which is a lot for some configurations. I think I understood (from the paper) that this is SOTA compared to SimCLR/Moco, even for small batch training, but are there things you can recommend to speed up the training on one GPU even if it means reducing the final accuracy. Things like:
- using adam / radam / other optimizers? Do you recommend removing LARC/LARS for small batch training (when keeping apex)?
- using other hyperparameters? larger queue size? smaller epsilon?
- using smaller images for a higher batch size?
- using a smaller size for prototypes? a higher size for the number of features?
The constraint would be: "How to reach the best accuracy with resnet50 and one middle-end GPU in 3 days?", even if the best I can get is 50% top1. I'm already trying things (I'm at ~30% Top1 in 15 epochs on one GPU) but you're surely more experienced on this model. In supervised learning I've managed to do things like that on my middle-end GPU.
Thanks if your have the time to answer!
Hi,have you tried the version of removing LARS(like using SGD)?
Hi @JrPeng thanks for the reply. I've tried many hyperparameters including removing LARS, I didn't see a clear improvement doing that (maybe I haven't done it well). I'm around ~50% top 1 after ~50 epochs now (mono GPU) which is the target I aimed on my configuration, the main tips I could give to someone training on one GPU:
My goal is also "fast training", I never go after 50 epochs. Two more things I did that improved the results for training faster: My training is usually stable when I don't freeze the prototypes (it seems that it can work in some configurations) + I'm slowly increasing the queue size depending on how fast the algorithm is learning ("slowly improving" => "raising queue size"), right after 1000 iterations. I measure how fast the algorithm is learning during training by measuring if the closest prototype of one image in a batch is another prototype of the same image (loss isn't relevant even normalized by nmb prototypes, epsilon changes the task), I do that over the entire batch and I average.
@Whiax , thank you for the reply. This is amazing.
@Whiax do you have a modified repo you could share?
@Whiax do you have a modified repo you could share?
No I don't have a modified repo that I could quickly share sorry @RylanSchaeffer. I modified the code a lot for my own application and it would take some time to re-make a version that could be public.
Dear, I would like to use my single GPU on my personal computer to test your code. Can you explain how to reproduce the training or test configuration.