Luodian / Otter

🦦 Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following and in-context learning ability.
https://otter-ntu.github.io/
MIT License
3.52k stars 241 forks source link

GPUs not being recognised properly #319

Closed StrangeTcy closed 6 months ago

StrangeTcy commented 7 months ago
  1. We clone your repo
  2. We create a conda env based on the environment file provided
  3. We modify the finetuning script a bit:
    
    #!/usr/bin/bash

CUDA_VISIBLE_DEVICES=1,2,3,4 accelerate launch \ --config_file=pipeline/accelerate_configs/accelerate_config_zero2.yaml \ --num_processes=8 \ --main_process_port=25000 \ pipeline/train/instruction_following.py \ --pretrained_model_name_or_path=adept/fuyu-8b \ --training_data_yaml=./Demo_Data.yaml \ --model_name=fuyu \ --instruction_format=fuyu \ --batch_size=8 \ --gradient_accumulation_steps=2 \ --num_epochs=3 \ --external_save_dir=./checkpoints \ --save_hf_model \ --run_name=OtterHD_Tester \ --wandb_project=Fuyu \ --report_to_wandb \ --workers=1 \ --lr_scheduler=linear \ --learning_rate=1e-5 \ --warmup_steps_ratio=0.01 \ --dynamic_resolution \ --weight_decay 0.1 \

 -- that uses the free GPUs on our 8 GPU machine.

4. We get the following error:

RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.



Is there anything wrong with pytorch, or possibly CUDA itself or accelerate?

Another question is why do you recommend using `accelerate_config_zero2` when `zero3` and `zero3_offload` are available as well?
StrangeTcy commented 7 months ago

Ok, I was using CUDA_VISIBLE_DEVICES=3,4,5,6 and num_processes=8 at the same time, which was stupid.

The question about zero configs still stands, though

Luodian commented 7 months ago

I would recommend using zero2 if you have A100-80G GPU, since it's way more faster than zero3. But if you are not having 80G GPU, say only 40G. Although I didnt get the model trained on 40G GPU, but I would recommend you try ZeRO Stage-3 to see if multiple 40G GPUs could launch the model.

Here's my log screenshot (even the zero3 is not with CPU offload, you can see it's still much slower):

image

The detailed difference is here: https://huggingface.co/docs/accelerate/usage_guides/deepspeed