fe1ixxu / ALMA

State-of-the-art LLM-based translation models.
MIT License
435 stars 35 forks source link

OOM 问题, 显卡是A00 40G #42

Open gongye19 opened 6 months ago

gongye19 commented 6 months ago

用llama factory进行sft可以使用deepspeed zero2 微调llama3-8B的模型,但这个框架就算batch设为1,用deepspeed zero2也会报OOM。 用zero3训练会变得很慢,出现这个问题:

2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time

fe1ixxu commented 6 months ago

OOM issue could be the reason that llama3 has 128K vocab size, while llama2 is 32K.

gongye19 commented 6 months ago

OOM issue could be the reason that llama3 has 128K vocab size, while llama2 is 32K.

I tried deepseek-7B, same question

fe1ixxu commented 6 months ago

The deepseek vocab size is also large -- 100K. The memory I used for training is 64GB for llama-2 with 8/16 GPUs. Maybe you want to try using fsdp.

gongye19 commented 6 months ago

The deepseek vocab size is also large -- 100K. The memory I used for training is 64GB for llama-2 with 8/16 GPUs. Maybe you want to try using fsdp.

谢谢,我现在是先用llama factory进行sft,再到你的框架上进行cpo。这样可以吗?

moore3930 commented 1 week ago

Technically, 1 GPU (80G) should be fine to fine-tune LLaMA 7B with Lora, but it seems always OOM under your codebase if we do not use 8 GPUs. I am just wondering why it costs a lot.