I want to train the model on two servers with one GPU each. But after I set up the configuration and ran it, the program stuck in one place and didn't react. I'm sure the program works when I train with a server.
(py36tr108cu117) (base) cx@v100:~/ViLT-master$ python run.py with data_root=../../data/TrinityMultimodalTrojAI-main/data/clean/ num_gpus=1 num_nodes=2 task_finetune_vqa_randaug per_gpu_batchsize=64 load_path=../../data/model_weight/vilt_200k_mlm_itm.ckpt
WARNING - root - Changed type of config entry "max_steps" from int to NoneType
WARNING - ViLT - No observers have been added to this run
INFO - ViLT - Running command 'main'
INFO - ViLT - Started
Global seed set to 0
INFO - lightning - Global seed set to 0
GPU available: True, used: True
INFO - lightning - GPU available: True, used: True
TPU available: None, using: 0 TPU cores
INFO - lightning - TPU available: None, using: 0 TPU cores
Using environment variable NODE_RANK for node rank (0).
INFO - lightning - Using environment variable NODE_RANK for node rank (0).
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO - lightning - LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Using native 16bit precision.
INFO - lightning - Using native 16bit precision.
Missing logger folder: result/finetune_vqa_randaug_seed0_from_vilt_200k_mlm_itm
WARNING - lightning - Missing logger folder: result/finetune_vqa_randaug_seed0_from_vilt_200k_mlm_itm
Global seed set to 0
INFO - lightning - Global seed set to 0
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
INFO - lightning - initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
INFO - root - Added key: store_based_barrier_key:1 to store for rank: 0
I want to train the model on two servers with one GPU each. But after I set up the configuration and ran it, the program stuck in one place and didn't react. I'm sure the program works when I train with a server.
export MASTER_ADDR=192.168.1.12 export MASTER_PORT=17788 export NODE_RANK=0
(py36tr108cu117) (base) cx@v100:~/ViLT-master$ python run.py with data_root=../../data/TrinityMultimodalTrojAI-main/data/clean/ num_gpus=1 num_nodes=2 task_finetune_vqa_randaug per_gpu_batchsize=64 load_path=../../data/model_weight/vilt_200k_mlm_itm.ckpt WARNING - root - Changed type of config entry "max_steps" from int to NoneType WARNING - ViLT - No observers have been added to this run INFO - ViLT - Running command 'main' INFO - ViLT - Started Global seed set to 0 INFO - lightning - Global seed set to 0 GPU available: True, used: True INFO - lightning - GPU available: True, used: True TPU available: None, using: 0 TPU cores INFO - lightning - TPU available: None, using: 0 TPU cores Using environment variable NODE_RANK for node rank (0). INFO - lightning - Using environment variable NODE_RANK for node rank (0). LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] INFO - lightning - LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] Using native 16bit precision. INFO - lightning - Using native 16bit precision. Missing logger folder: result/finetune_vqa_randaug_seed0_from_vilt_200k_mlm_itm WARNING - lightning - Missing logger folder: result/finetune_vqa_randaug_seed0_from_vilt_200k_mlm_itm Global seed set to 0 INFO - lightning - Global seed set to 0 initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2 INFO - lightning - initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2 INFO - root - Added key: store_based_barrier_key:1 to store for rank: 0
The program stops at this point