Finetuning MM results in `runtimeerror: cuda error: invalid device ordinal`

lukszam commented 11 months ago

Thanks for the great repo! Is it possible to finetune MM on instruct-llava from that checkpoint: https://huggingface.co/Alpha-VLLM/LLaMA2-Accessory/tree/main/finetune/mm/caption_llamaQformerv2_13b/ using this script: https://github.com/Alpha-VLLM/LLaMA2-Accessory/blob/main/accessory/exps/finetune/mm/alpacaLlava_llamaQformerv2Peft_QF_13B.sh with one A10G GPU (24GB VRAM)?

It seems like the script, by default tries to distribute the load over 8 workers and I'm getting: runtimeerror: cuda error: invalid device ordinal

Which settings/pre-trained model should I use?

kriskrisliu commented 11 months ago

Hi friend, you could try --nproc_per_node=1 instead of 8. --nproc_per_node=8 means that we load 8 copies of the model into 8 cards respectively. In general, cuda error: invalid device ordinal refers that the system could not find other 7 cards since you only have 1.

lukszam commented 11 months ago

Thanks @kriskrisliu, after trying your suggestion the process starts without error

funkyyyyyy commented 11 months ago

thanks @kriskrisliu

Alpha-VLLM / LLaMA2-Accessory

Finetuning MM results in `runtimeerror: cuda error: invalid device ordinal` #71