Closed wlfxy closed 10 months ago
The image is not uploaded correctly, so I can not see the parameters. Why not directly run the scripts following the instructions?
Since I only have one gpu, running sh run_CVUSA.sh directly seems to require multiple Gpus, and when I run sh run_cvusa.sh directly, I get an error. The error is work = _default_pg.barrier(). RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1607370172916/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, unhandled system error, NCCL version 2.7.8
You may need to specify the GPUs for training in "train.py". Remove the second line if you want to train the simple stage-1 model. Change the "--dataset" to train on other datasets. The code follows the multiprocessing distributed training style from PyTorch and Moco, but it only uses one GPU by default for training,The readme paragraph should mean using a single GPU parameter, but the content of the command line should run in a distributed manner with multiple Gpus
It does not require multiple GPUs. If the train.py, we use os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID" and os.environ["CUDA_VISIBLE_DEVICES"] = "0" to assign one GPU.
hello?I want to ask some questions!!!!
Hello, would you like to know that when training cvusa, a gpu is used, gou is set to 1,lr is set to 0.0001, batch-size is set to 32, did-URL is set to 'tcp://localhost:10001' and world-size is set to 1. rank set to 0, epochs set to 100, op set to sam, wd set to 0.03, dataset set to cvusa, cos set to True,dim set to 1000, asam set to True, rho set to 2.5. But the result of the first stage is very bad, I would like to ask if I made a mistake, I took a screenshot of the specific parameter Settings, thank you ![Uploading 屏幕截图 2023-11-23 231824.png…]()