Open ggxxer opened 7 months ago
The main_moco.py is using pytorch multi-gpu distributed training. Please verify that your cuda environment is set up correctly for distributed training.
The main_moco.py is using pytorch multi-gpu distributed training. Please verify that your cuda environment is set up correctly for distributed training.
Thank you for your reply. After setting up my cuda environment succesfully for distributed training, I've ran main_moco.py for nearly 12 hours. But model seems to fail to load TPS model successfully, since line 342: "print(f"TPS layer (freezed): {name}\n") ", fail to print anything. I have finished the steps according to your README.txt, can you give me any advice to solve this error?
The line 342 means that we load off-the-shelf pretrained TPS weights, and we freeze TPS during pre-training.
You should download the TPS model weights(TRBA-Baseline-synth.pth) from baiduyun (password:px16) and put it as pretrain/TPS_model/TRBA-Baseline-synth.pth
.
I've finished this step according to README.txt, but this situation still exists.
Maybe you can check the key names in the TPS weights. It's actually a dictionary, and you can check to see if the names are correct during loading.
When main_moco.py run to line 262 "mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))", it reports error like: "raise RuntimeError("No rendezvous handler for {}://".format(result.scheme)) RuntimeError: No rendezvous handler for ://" Can you give me any advice to solve this error?