Closed passion3394 closed 4 years ago
This top-1 accuracy is the (K+1)-way (e.g., K=65536) classification accuracy of the dictionary look-up in contrastive learning. It is not the ImageNet classification accuracy. Please finish the entire pipeline and check what you get. (Btw, for 4-GPU training, we recommend --lr 0.015 --batch-size 128, though what you are doing should be ok.)
@KaimingHe thanks for your reply. I found that the training was very slow on my server. It is about 2.5 hours for one epoch, thus it should use more than 20 days to train for 200 epochs.The situation is different from your settings
Re speed: see #33 and #5:
I believe your issue is not particularly related to this repo. Please try running the PyTorch offcial ImageNet training code, which this repo is based, to check the data loading speed of your environment.
Yes, the problem is at the IO of my server, not this repo. I have used serval methods to find the truth. Thanks for your tips.
@passion3394 Thank you for your question, what method did you use to solve the io problem?
Epoch: [34][3590/4999] Time 0.426 ( 1.635) Data 0.000 ( 0.227) Loss 6.8926e+00 (6.9147e+00) Acc@1 73.44 ( 76.76) Acc@5 87.50 ( 87.55) Epoch: [34][3600/4999] Time 0.437 ( 1.638) Data 0.000 ( 0.227) Loss 7.0694e+00 (6.9147e+00) Acc@1 59.38 ( 76.76) Acc@5 76.56 ( 87.55) Epoch: [34][3610/4999] Time 0.432 ( 1.638) Data 0.000 ( 0.226) Loss 6.9074e+00 (6.9146e+00) Acc@1 78.12 ( 76.76) Acc@5 90.62 ( 87.55) Epoch: [34][3620/4999] Time 0.423 ( 1.639) Data 0.000 ( 0.225) Loss 6.9464e+00 (6.9146e+00) Acc@1 71.88 ( 76.76) Acc@5 85.94 ( 87.55) Epoch: [34][3630/4999] Time 0.436 ( 1.644) Data 0.000 ( 0.225) Loss 6.8364e+00 (6.9145e+00) Acc@1 81.25 ( 76.77) Acc@5 89.06 ( 87.56) Epoch: [34][3640/4999] Time 0.425 ( 1.646) Data 0.000 ( 0.224) Loss 6.9520e+00 (6.9145e+00) Acc@1 71.88 ( 76.76) Acc@5 85.94 ( 87.56) Epoch: [34][3650/4999] Time 0.426 ( 1.646) Data 0.000 ( 0.224) Loss 6.8319e+00 (6.9145e+00) Acc@1 84.38 ( 76.77) Acc@5 87.50 ( 87.56) Epoch: [34][3660/4999] Time 0.428 ( 1.646) Data 0.000 ( 0.223) Loss 6.8066e+00 (6.9144e+00) Acc@1 75.00 ( 76.78) Acc@5 90.62 ( 87.57) Epoch: [34][3670/4999] Time 0.471 ( 1.651) Data 0.000 ( 0.222) Loss 6.9694e+00 (6.9144e+00) Acc@1 78.12 ( 76.77) Acc@5 89.06 ( 87.57) Epoch: [34][3680/4999] Time 0.431 ( 1.650) Data 0.000 ( 0.222) Loss 6.8628e+00 (6.9144e+00) Acc@1 81.25 ( 76.77) Acc@5 87.50 ( 87.57) Epoch: [34][3690/4999] Time 0.428 ( 1.650) Data 0.000 ( 0.221) Loss 6.8666e+00 (6.9145e+00) Acc@1 81.25 ( 76.77) Acc@5 92.19 ( 87.56) Epoch: [34][3700/4999] Time 0.434 ( 1.650) Data 0.000 ( 0.221) Loss 6.9402e+00 (6.9144e+00) Acc@1 71.88 ( 76.78) Acc@5 87.50 ( 87.57) Epoch: [34][3710/4999] Time 0.434 ( 1.654) Data 0.000 ( 0.220) Loss 6.8522e+00 (6.9144e+00) Acc@1 81.25 ( 76.78) Acc@5 92.19 ( 87.57) Epoch: [34][3720/4999] Time 0.421 ( 1.655) Data 0.000 ( 0.219) Loss 6.8393e+00 (6.9145e+00) Acc@1 79.69 ( 76.78) Acc@5 90.62 ( 87.57) Epoch: [34][3730/4999] Time 0.426 ( 1.658) Data 0.000 ( 0.219) Loss 6.9804e+00 (6.9145e+00) Acc@1 68.75 ( 76.78) Acc@5 81.25 ( 87.57) Epoch: [34][3740/4999] Time 0.424 ( 1.658) Data 0.000 ( 0.218) Loss 7.0028e+00 (6.9144e+00) Acc@1 75.00 ( 76.78) Acc@5 82.81 ( 87.57) Epoch: [34][3750/4999] Time 0.438 ( 1.662) Data 0.000 ( 0.218) Loss 6.9528e+00 (6.9144e+00) Acc@1 75.00 ( 76.78) Acc@5 82.81 ( 87.57) Epoch: [34][3760/4999] Time 0.423 ( 1.664) Data 0.000 ( 0.217) Loss 6.8455e+00 (6.9143e+00) Acc@1 76.56 ( 76.79) Acc@5 93.75 ( 87.57) Epoch: [34][3770/4999] Time 0.430 ( 1.666) Data 0.000 ( 0.217) Loss 6.9374e+00 (6.9143e+00) Acc@1 81.25 ( 76.79) Acc@5 90.62 ( 87.57)
I use the following command to train on ImageNet with 4 2080ti:
python main_moco.py -a resnet50 --mlp --moco-t 0.2 --aug-plus --cos --lr 0.015 --batch-size 256 --dist-url 'tcp://localhost:10001' --multiprocessing-distributed --world-size 1 --rank 0 /job/large_dataset/open_datasets/ImageNet/
I doubt it is training on the supervised manner. Are there anything wrong with my experiments?