dandelin / ViLT

Code for the ICML 2021 (long talk) paper: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision"
Apache License 2.0
1.36k stars 209 forks source link

python run.py with data_root=content/datasets num_gpus=2 num_nodes=1 task_mlm_itm whole_word_masking=True step100k per_gpu_batchsize=64 #44

Open F-Yuan303 opened 2 years ago

F-Yuan303 commented 2 years ago

i encounter this when i pre-train with coco: WARNING - ViLT - No observers have been added to this run INFO - ViLT - Running command 'main' INFO - ViLT - Started Global seed set to 0 INFO - lightning - Global seed set to 0 INFO - timm.models.helpers - Loading pretrained weights from url (https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-vitjx/jx_vit_base_p32_384-830016f5.pth) GPU available: True, used: True INFO - lightning - GPU available: True, used: True TPU available: None, using: 0 TPU cores INFO - lightning - TPU available: None, using: 0 TPU cores Using environment variable NODE_RANK for node rank (). INFO - lightning - Using environment variable NODE_RANK for node rank (). ERROR - ViLT - Failed after 0:00:06! Traceback (most recent calls WITHOUT Sacred internals): File "run.py", line 67, in main val_check_interval=_config["val_check_interval"], File "/data/fyuan/anaconda3/envs/pytorch/lib/python3.7/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 41, in overwrite_by_env_vars return fn(self, **kwargs) File "/data/fyuan/anaconda3/envs/pytorch/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 359, in init deterministic, File "/data/fyuan/anaconda3/envs/pytorch/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator_connector.py", line 127, in on_trainer_init self.trainer.node_rank = self.determine_ddp_node_rank() File "/data/fyuan/anaconda3/envs/pytorch/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator_connector.py", line 415, in determine_ddp_node_rank return int(rank) ValueError: invalid literal for int() with base 10: ''

csclimber commented 2 years ago

Have you solved it? I have the same bug.

F-Yuan303 commented 2 years ago

Have you solved it? I have the same bug.

Not yet bro.

csclimber commented 2 years ago

I solved it bro! Don't forget to set these variables! export MASTER_ADDR=$DIST_0_IP export MASTER_PORT=$DIST_0_PORT export NODE_RANK=$DIST_RANK

F-Yuan303 commented 2 years ago

I solved it bro! Don't forget to set these variables! export MASTER_ADDR=$DIST_0_IP export MASTER_PORT=$DIST_0_PORT export NODE_RANK=$DIST_RANK

it works, thanks!

dandelin commented 2 years ago

Nice job! :)

haoshuai714 commented 2 years ago

I solved it bro! Don't forget to set these variables! export MASTER_ADDR=$DIST_0_IP export MASTER_PORT=$DIST_0_PORT export NODE_RANK=$DIST_RANK

if use one machine and 8 GPUS, how to set these variables?