Closed StrangeTcy closed 6 months ago
Ok, I was using CUDA_VISIBLE_DEVICES=3,4,5,6
and num_processes=8
at the same time, which was stupid.
The question about zero configs still stands, though
I would recommend using zero2
if you have A100-80G GPU, since it's way more faster than zero3
. But if you are not having 80G GPU, say only 40G. Although I didnt get the model trained on 40G GPU, but I would recommend you try ZeRO Stage-3
to see if multiple 40G GPUs could launch the model.
Here's my log screenshot (even the zero3 is not with CPU offload, you can see it's still much slower):
The detailed difference is here: https://huggingface.co/docs/accelerate/usage_guides/deepspeed
CUDA_VISIBLE_DEVICES=1,2,3,4 accelerate launch \ --config_file=pipeline/accelerate_configs/accelerate_config_zero2.yaml \ --num_processes=8 \ --main_process_port=25000 \ pipeline/train/instruction_following.py \ --pretrained_model_name_or_path=adept/fuyu-8b \ --training_data_yaml=./Demo_Data.yaml \ --model_name=fuyu \ --instruction_format=fuyu \ --batch_size=8 \ --gradient_accumulation_steps=2 \ --num_epochs=3 \ --external_save_dir=./checkpoints \ --save_hf_model \ --run_name=OtterHD_Tester \ --wandb_project=Fuyu \ --report_to_wandb \ --workers=1 \ --lr_scheduler=linear \ --learning_rate=1e-5 \ --warmup_steps_ratio=0.01 \ --dynamic_resolution \ --weight_decay 0.1 \
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.