Some details about the training

Hanzy1996 commented 3 years ago

Hi, I have recently read your paper and find it very interesting. There are still some confusions about the experiments.

The experiments require 4 2080ti for training. Does it mean we must have 4 2080ti on one single machine? What if I have 4 2080ti on different machines? Is there any suggestion for this situation? BTW, how long does it take when you train on ImageNet1k?

Much appreciation for your reply.

Best wishes!

akuxcw commented 3 years ago

Hi, we didn't try training on different machines. But it should work on different machines. Suppose you have two machines and each of them has 2 2080ti. Try to run this command on machine one:

python main.py --dist-url 'tcp://192.168.100.1:10107' --multiprocessing-distributed --world-size 2 --rank 0 \
    -a resnet50 \
    --lr 0.03 --batch-size 128 --epoch 200 \
    --save-dir outputs/jigclu_pretrain/ \
    --resume outputs/jigclu_pretrain/model_best.pth.tar \
    --loss-t 0.3 \
    --cross-ratio 0.3 \
    datasets/ImageNet/

and then run the following on machine two:

python main.py --dist-url 'tcp://192.168.100.1:10107' --multiprocessing-distributed --world-size 2 --rank 1 \
    -a resnet50 \
    --lr 0.03 --batch-size 128 --epoch 200 \
    --save-dir outputs/jigclu_pretrain/ \
    --resume outputs/jigclu_pretrain/model_best.pth.tar \
    --loss-t 0.3 \
    --cross-ratio 0.3 \
    datasets/ImageNet/

Replace "192.168.100.1" with the IP of machine one. We didn't check the above command, you might need to revise it to running normally.

Thanks!

Hanzy1996 commented 3 years ago

Much appreciation for your help and your patience! I will try with the code you provided.

Best wishes!

cellphonef commented 2 years ago

Hi, we didn't try training on different machines. But it should work on different machines. Suppose you have two machines and each of them has 2 2080ti. Try to run this command on machine one:
python main.py --dist-url 'tcp://192.168.100.1:10107' --multiprocessing-distributed --world-size 2 --rank 0 \
    -a resnet50 \
    --lr 0.03 --batch-size 128 --epoch 200 \
    --save-dir outputs/jigclu_pretrain/ \
    --resume outputs/jigclu_pretrain/model_best.pth.tar \
    --loss-t 0.3 \
    --cross-ratio 0.3 \
    datasets/ImageNet/
and then run the following on machine two:
python main.py --dist-url 'tcp://192.168.100.1:10107' --multiprocessing-distributed --world-size 2 --rank 1 \
    -a resnet50 \
    --lr 0.03 --batch-size 128 --epoch 200 \
    --save-dir outputs/jigclu_pretrain/ \
    --resume outputs/jigclu_pretrain/model_best.pth.tar \
    --loss-t 0.3 \
    --cross-ratio 0.3 \
    datasets/ImageNet/
Replace "192.168.100.1" with the IP of machine one. We didn't check the above command, you might need to revise it to running normally.

Thanks!

Does the world_size have the same meaning of nnodes and the rank have the same meaning of node_rank?

dvlab-research / JigsawClustering

Some details about the training #3