Adopt qlora-pipe approaches

kallewoof commented 1 month ago

⚠️ Please check that this feature request hasn't been suggested before.

[X] I searched previous Ideas in Discussions didn't find any similar feature requests.
[X] I searched previous Issues didn't find any similar feature requests.

🔖 Feature description

Recently I have been spending a lot of time playing with https://github.com/tdrussell/qlora-pipe and it has some features that are downright amazing.

The merge-lora script does not load the model into memory, period. It just iterates through each of the bin or safetensors shards and applies the lora to each module as it needs. It's extremely efficient compared to the standard approach.
The deepspeed integration is really good. I have never gotten deepspeed to work doing qlora training (in fact most web searches claim it isn't even possible) but with qlora-pipe, it just works out of the box, no hassles. In fact, I have never even gotten deepspeed to work with axolotl, so there's something to be learned there, either way.

✔️ Solution

Learn from and adopt his code design decisions in areas where this makes sense.

❓ Alternatives

No response

📝 Additional Context

No response

Acknowledgements

[X] My issue title is concise, descriptive, and in title casing.
[X] I have searched the existing issues to make sure this feature has not been requested yet.
[X] I have provided enough information for the maintainers to understand and evaluate this request.

winglian commented 4 weeks ago

@kallewoof thanks for surfacing this. There's definitely a lot going on here, and It might be useful to figure out which are the important takeaways and can be integrated in a sustainable manner.

I think we can integrate the merge_lora using a new flag to use that methodology instead of the standard method. I think. we could want to keep the original merging script for adapters that aren't purely lora.

kallewoof commented 4 weeks ago

@kallewoof thanks for surfacing this. There's definitely a lot going on here, and It might be useful to figure out which are the important takeaways and can be integrated in a sustainable manner.

I think we can integrate the merge_lora using a new flag to use that methodology instead of the standard method. I think. we could want to keep the original merging script for adapters that aren't purely lora.

Good point on keeping both. It would be great if it defaulted to the qlora-pipe method if it was a pure lora, but just having the option is a step in the right direction.

As I mentioned above: a deepspeed config example for dual GPUs where the model is split across both devices. Despite using axolotl for a while now I still haven't gotten this to work. With qlora-pipe I literally only had to set pipeline_stages = 2 in the config file and it worked out of the box.

Also: offload_mlp_to_cpu gives me a minor performance hit but uses a lot less VRAM. I literally was unable to do a 70B qlora fine tune in Axolotl without running out of VRAM. In qlora-pipe, (with unsloth activation_checkpointing), I could do this with the exact same params (rank, context size, etc) and it would use 16 + 15 GB VRAM (i.e. 8 and 9 GB left unused). Admittedly it used a bit more during initial startup, but still.

Using the HQQ quantization option with nbits=4 and gate/up/down nbits=2, I can do a 8192 context 32 rank lora fine tune in qlora-pipe.

chiragjn commented 3 weeks ago

This is quite interesting, pipeline parallelism would be nice to have for large models qlora that cannot fit in a single gpu (although we have fsdp qlora now)

In fact, I have never even gotten deepspeed to work with axolotl

Just wondering what is the problem here, I have been training qlora using axolotl + deepspeed zero 2 for months now without major issues It is known that deepspeed zero 3 does not work with qlora out of the box

kallewoof commented 3 weeks ago

This is quite interesting, pipeline parallelism would be nice to have for large models qlora that cannot fit in a single gpu (although we have fsdp qlora now)

fsdp qlora seems to come with a significant overhead. At least in my case I can't use my 2x24 gb VRAM cards to do L3 70B training whereas I am able to with deepspeed.

Just wondering what is the problem here, I have been training qlora using axolotl + deepspeed zero 2 for months now without major issues It is known that deepspeed zero 3 does not work with qlora out of the box

Correct me if I'm wrong, but deepspeed zero 2 does not split the model across the two devices, does it? Unfortunately I need to do that to fit 70B models.

chiragjn commented 3 weeks ago

Correct me if I'm wrong, but deepspeed zero 2 does not split the model across the two devices, does it? Unfortunately I need to do that to fit 70B models.

Correct, Zero 2 only splits optimizer and gradients not parameters Zero 3 splits parameters but cannot deal with quantized weights

In those cases pipeline parallelism via Deepspeed PP would make sense

kallewoof commented 3 weeks ago

Correct, Zero 2 only splits optimizer and gradients not parameters Zero 3 splits parameters but cannot deal with quantized weights

In those cases pipeline parallelism via Deepspeed PP would make sense

Thanks! Yeah I believe that's what I use in qlora-pipe.

kallewoof commented 3 weeks ago

Another example: I experimented with a command-r 35b fine tune. I stopped trying in axolotl around 4096 context when it OOM'd. I am able to do 8192 context with over 8 GB of VRAM to spare in qlora-pipe, at rank 32.

I do suspect this is mostly due to me being unable to get deepspeed + qlora + unsloth working in axolotl, which I again stress is something we should have a clear and easy example available somewhere. Just "use deepspeed" is maybe enough, and I'm just having really really bad luck finding the right docs, I don't know.

chiragjn commented 3 weeks ago

It is expected that qlora-pipe would be better at this Deepspeed Zero 2 is just efficient DDP which means that the base model and lora adapter weights have to fit entirely per gpu. The only benefit of multiple gpus is larger global batch size - at least in lora where number of trainable params is so small the effect of splitting optimizer and gradient states is negligible.

From this thread I don't know what GPUs you are using but command-r 35b qlora at 4 bit, rank 32 all linear layers I would expect ~30-35GB for 4096 tokens at bs=1 and gradient checkpointing

Compare that to PP where the model has been sharded in groups of layers across multiple gpus - so the more the number of gpus the less memory required per gpu for base model leaving more for longer sequences.

Deepspeed PP would be great addition - I am certainly going to try it out for my own curiosity.

kallewoof commented 3 weeks ago

From this thread I don't know what GPUs you are using but command-r 35b qlora at 4 bit, rank 32 all linear layers I would expect ~30-35GB for 4096 tokens at bs=1 and gradient checkpointing

I have 2x24 GB VRAM.

OpenAccess-AI-Collective / axolotl