How to use 3090 to train 16k model?

DachengLi1 / LongChat

Official repository for LongChat and LongEval

Apache License 2.0

504 stars 29 forks source link

How to use 3090 to train 16k model? #4

Open aresa7796 opened 1 year ago

aresa7796 commented 1 year ago

I have 80k supervised data, but only 3090 graphics card, how to use 3090 to train 16k model?

musabgultekin commented 1 year ago

While technically it can work, its probably gonna take too much VRAM and will be horribly slow. Checkout: https://huggingface.co/docs/transformers/perf_train_gpu_one

DachengLi1 commented 1 year ago

@aresa7796 The current code is assuming 8xA100 40GB. I think 3090 should be able to run after applying some system techniques. I think if we can support training for 3090 GPUs (or non-A100), it will be really amazing. We just didn't get a hand on it now, can you try and share some of your feedback? Here are the steps I think should work:

(1) Use deepspeed zero offloading as shared by @musabgultekin ; (2) Change the monkey patch from flash attention to xformer by calling this function. Xformer is a memory efficient attention which supports non-a100 GPUs. I already have the monkey patch implemented.:P (3) Change bf16 (delete the tf32 argument as well) to fp16 in the training command.

Let me know if this works for you!

lucasjinreal commented 1 year ago

Am also wondering for this. For instance, using v100 which might not possible feed 2048 at all, if using 1024 and applying condensing rotary embeddings in a ratio of 16, will work? How good?

DachengLi1 commented 1 year ago

@lucasjinreal condensing rotary does not reduce memory, it only makes model good quality with 16K.

lucasjinreal commented 1 year ago

@DachengLi1 what I menas, v100 can not feed too much minimal len like 2048 for most cases.

DachengLi1 commented 1 year ago

@lucasjinreal i see thanks! Condensing will be great, I believe it should work from 1024 to 8192 say. But the thing is you will still need to fine-tune on the longer length a bit after condensing - but I guess you can resort to A100 for that adapting part?

lucasjinreal commented 1 year ago

@DachengLi1 hi, wanna discuss a bit more, have u tried compare with your method with ALibi on Extrapolation ability?