caikit / caikit-nlp

Apache License 2.0
12 stars 45 forks source link

Multi-gpu prompt tuning hanging when running in kube cluster #271

Open gkumbhat opened 9 months ago

gkumbhat commented 9 months ago

Description

We are using torch distributed elastic launch method to kick off training on multi-gpu single node environment. It seems to be working fine when running locally, i.e in a machine that has multi-gpu available, it also works fine on single GPU but it hangs when we provide WORLD_SIZE, MASTER_ADDR, MASTER_PORT parameters. There seems to be some issue with the master address/port configuration where its trying to connect with the GPU but keeps waiting.

Run command:

ALLOW_DOWNLOADS=true  WORLD_SIZE=2 RANK=0 MASTER_ADDR=localhost MASTER_PORT=25590  python3 run_peft_tuning.py PROMPT_TUNING --dataset "glue/rte"  --model_name google/flan-t5-xl --num_epochs 1 --verbose --prompt_tuning_init TEXT  --output_dir prompt_prefixes/flan_t5_xl_1_epoch_rte_16_batch_1_acc_hf_trainer --learning_rate 0.3 --batch_size=16 --accumulate_steps 1 --max_target_length 512 --max_source_length 2048 --torch_dtype bfloat16

Relevant code to launch the training:

MEllis-github commented 9 months ago

@gkumbhat Are you running a second instance of that command anywhere else, or what is the rationale for setting WORLD_SIZE to 2? From a cursory glance, the first process could be waiting for the second which has never been started. Also from the links provided, it appears caikit is using torch multiprocessing... Are WORLD_SIZE, RANK, MASTER_ADDR, or MASTER_PORT set in the environment prior to running this command, and if so what are their values?

gkumbhat commented 9 months ago

@MEllis-github I was setting up WORLD_SIZE as 2 to allow use of 2 GPUs by different processes, that should still work right? 🤔

I am setting these as environment variables, I tried couple of values for MASTER_ADDR:

  1. localhost - since is running within same pod, which worked on local machines with 2 GPUs
  2. <hostname for the pod> - thinking that it might be connecting to lower level cuda process, which might recognize the pod with the name

for MASTER_PORT, I basically used different different values, thinking if there is a conflict.