When distributed training was performed, the program remained unresponsive

I want to train the model on two servers with one GPU each. But after I set up the configuration and ran it, the program stuck in one place and didn't react. I'm sure the program works when I train with a server.

export MASTER_ADDR=192.168.1.12 export MASTER_PORT=17788 export NODE_RANK=0

(py36tr108cu117) (base) cx@v100:~/ViLT-master$ python run.py with data_root=../../data/TrinityMultimodalTrojAI-main/data/clean/ num_gpus=1 num_nodes=2 task_finetune_vqa_randaug per_gpu_batchsize=64 load_path=../../data/model_weight/vilt_200k_mlm_itm.ckpt WARNING - root - Changed type of config entry "max_steps" from int to NoneType WARNING - ViLT - No observers have been added to this run INFO - ViLT - Running command 'main' INFO - ViLT - Started Global seed set to 0 INFO - lightning - Global seed set to 0 GPU available: True, used: True INFO - lightning - GPU available: True, used: True TPU available: None, using: 0 TPU cores INFO - lightning - TPU available: None, using: 0 TPU cores Using environment variable NODE_RANK for node rank (0). INFO - lightning - Using environment variable NODE_RANK for node rank (0). LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] INFO - lightning - LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] Using native 16bit precision. INFO - lightning - Using native 16bit precision. Missing logger folder: result/finetune_vqa_randaug_seed0_from_vilt_200k_mlm_itm WARNING - lightning - Missing logger folder: result/finetune_vqa_randaug_seed0_from_vilt_200k_mlm_itm Global seed set to 0 INFO - lightning - Global seed set to 0 initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2 INFO - lightning - initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2 INFO - root - Added key: store_based_barrier_key:1 to store for rank: 0

The program stops at this point

dandelin / ViLT

When distributed training was performed, the program remained unresponsive #92