Hi, it is possible to train on a single gpu, just set --nproc_per_node=1
thanks for your help! But i just wanna excute so as to debug one line by one line. even i set distributed to 'False', when perform on train_stats = train(model, train_loader, optimizer, tokenizer, epoch, warmup_steps, device, lr_scheduler, config), there is BUG report on the last line : @torch.no_grad() def concat_all_gather(tensor): """ Performs all_gather operation on the provided tensors. Warning : torch.distributed.all_gather has no gradient. """ tensors_gather = [torch.oneslike(tensor) for in range(torch.distributed.get_world_size())] and details were : Default process group has not been initialized, please make sure to call init_process_group. So i guess, i should fix something to skip the Default process initialization.
Could you please provide more info about your environment? I've tried the code right now and i do not have any error. My launch script is:
source activate pytorch-GAN
python -m --nproc_per_node=1 --rdzv_endpoint= \ \
--config configs/PS_cuhk_pedes.yaml \
--output_dir output/cuhk-pedes/\
--eval_mAP \
--checkpoint /home/user/projects/MARS/checkpoint/ALBEF.pth
My conda env has the following packages (not all of them are mandatory, It is just a test env full of packages) and since in this conda i have installed the new transformer package version i had to change in file each tokenizer_class
with processor_class
thanks for your kind help~ I'm sure using torch.distributed to start training is successful no matter single or multiple gpus. But this standard way to run programs in the background and we can't be able to debug line by line. For me who is not familiar with your code, i prefer to just run is in your project) in vscode, thus I can observe the details( just by starting debug and setting breakpoint) I set args in launch.json as below: { "version": "0.2.0", "configurations": [
"name":"launch T2I",
"type": "debugpy",
"request": "launch",
"program": "${workspaceFolder}/RaSa/",
"args": [
"--config" , "${workspaceFolder}/RaSa/configs/PS_cuhk_pedes.yaml",
"--output_dir" , "${workspaceFolder}/RaSa/output/cuhk-pedes/train",
"--checkpoint" , "${workspaceFolder}/models/ALBEF/ALBEF.pth",
"--distributed" , "false"
// "DISPLAY":"localhost:10.0"
"justMyCode": true,
I think it maybe enough to run, but I just encounter the BUG, which seems like i can only using to start training and really confuse me. the detailed BUG is below:
xception has occurred: RuntimeError
Default process group has not been initialized, please make sure to call init_process_group.
File "/media/data1/yanghao/RaSa/models/", line 278, in concat_allgather
for in range(torch.distributed.get_world_size())]
File "/media/data1/yanghao/RaSa/models/", line 218, in _dequeue_and_enqueue
image_feats = concat_all_gather(image_feat)
File "/media/data1/yanghao/RaSa/models/", line 103, in forward
self._dequeue_and_enqueue(image_feat_m, text_feat_m, idx)
File "/media/data1/yanghao/RaSa/", line 49, in train
loss_cl, loss_pitm, loss_mlm, loss_prd, loss_mrtd = model(image1, image2, text_input1, text_input2,
File "/media/data1/yanghao/RaSa/", line 264, in main
train_stats = train(model, train_loader, optimizer, tokenizer, epoch, warmup_steps, device, lr_scheduler,
File "/media/data1/yanghao/RaSa/", line 331, in
Now it is clear, if you want to debug the code line by line you have to modify it. The concat_all_gather method only works if it is used with torch run. To debug it, you have to comment out every line of code connected to pytorch distributed data parallel, which could result in a lot of work!
Thanks ! Even your code is based on RaSa, it is still beautifully written. I think, In the initial stage of your writing code, for easier debugging and inspection, torch.distributed would probably not be involved in. If so, may I ask if you have any plans to make that early version of the code public? Thanks~
I am sorry, but currently I do not have a debug-ready version of the code, and we do not plan to publish one soon as we are working on other projects
Hi , Thanks for your owesome work ! i was doing experiment on your code, and found there is no one-gpu training shell available. So i excute to debug, and found the process group was necessaryly need to be initialized. Does that means one-gpu training was not supported on your orginal code? If there is a quick way to start training on one-gpu?
Sincerely looking for your reply~