Open kallewoof opened 1 month ago
@kallewoof thanks for surfacing this. There's definitely a lot going on here, and It might be useful to figure out which are the important takeaways and can be integrated in a sustainable manner.
I think we can integrate the merge_lora using a new flag to use that methodology instead of the standard method. I think. we could want to keep the original merging script for adapters that aren't purely lora.
@kallewoof thanks for surfacing this. There's definitely a lot going on here, and It might be useful to figure out which are the important takeaways and can be integrated in a sustainable manner.
I think we can integrate the merge_lora using a new flag to use that methodology instead of the standard method. I think. we could want to keep the original merging script for adapters that aren't purely lora.
Good point on keeping both. It would be great if it defaulted to the qlora-pipe method if it was a pure lora, but just having the option is a step in the right direction.
As I mentioned above: a deepspeed config example for dual GPUs where the model is split across both devices. Despite using axolotl for a while now I still haven't gotten this to work. With qlora-pipe I literally only had to set pipeline_stages = 2
in the config file and it worked out of the box.
Also: offload_mlp_to_cpu
gives me a minor performance hit but uses a lot less VRAM. I literally was unable to do a 70B qlora fine tune in Axolotl without running out of VRAM. In qlora-pipe, (with unsloth activation_checkpointing), I could do this with the exact same params (rank, context size, etc) and it would use 16 + 15 GB VRAM (i.e. 8 and 9 GB left unused). Admittedly it used a bit more during initial startup, but still.
Using the HQQ quantization option with nbits=4 and gate/up/down nbits=2, I can do a 8192 context 32 rank lora fine tune in qlora-pipe.
This is quite interesting, pipeline parallelism would be nice to have for large models qlora that cannot fit in a single gpu (although we have fsdp qlora now)
In fact, I have never even gotten deepspeed to work with axolotl
Just wondering what is the problem here, I have been training qlora using axolotl + deepspeed zero 2 for months now without major issues It is known that deepspeed zero 3 does not work with qlora out of the box
This is quite interesting, pipeline parallelism would be nice to have for large models qlora that cannot fit in a single gpu (although we have fsdp qlora now)
fsdp qlora seems to come with a significant overhead. At least in my case I can't use my 2x24 gb VRAM cards to do L3 70B training whereas I am able to with deepspeed.
Just wondering what is the problem here, I have been training qlora using axolotl + deepspeed zero 2 for months now without major issues It is known that deepspeed zero 3 does not work with qlora out of the box
Correct me if I'm wrong, but deepspeed zero 2 does not split the model across the two devices, does it? Unfortunately I need to do that to fit 70B models.
Correct me if I'm wrong, but deepspeed zero 2 does not split the model across the two devices, does it? Unfortunately I need to do that to fit 70B models.
Correct, Zero 2 only splits optimizer and gradients not parameters Zero 3 splits parameters but cannot deal with quantized weights
In those cases pipeline parallelism via Deepspeed PP would make sense
Correct, Zero 2 only splits optimizer and gradients not parameters Zero 3 splits parameters but cannot deal with quantized weights
In those cases pipeline parallelism via Deepspeed PP would make sense
Thanks! Yeah I believe that's what I use in qlora-pipe.
Another example: I experimented with a command-r 35b fine tune. I stopped trying in axolotl around 4096 context when it OOM'd. I am able to do 8192 context with over 8 GB of VRAM to spare in qlora-pipe, at rank 32.
I do suspect this is mostly due to me being unable to get deepspeed + qlora + unsloth working in axolotl, which I again stress is something we should have a clear and easy example available somewhere. Just "use deepspeed" is maybe enough, and I'm just having really really bad luck finding the right docs, I don't know.
It is expected that qlora-pipe would be better at this Deepspeed Zero 2 is just efficient DDP which means that the base model and lora adapter weights have to fit entirely per gpu. The only benefit of multiple gpus is larger global batch size - at least in lora where number of trainable params is so small the effect of splitting optimizer and gradient states is negligible.
From this thread I don't know what GPUs you are using but command-r 35b qlora at 4 bit, rank 32 all linear layers I would expect ~30-35GB for 4096 tokens at bs=1 and gradient checkpointing
Compare that to PP where the model has been sharded in groups of layers across multiple gpus - so the more the number of gpus the less memory required per gpu for base model leaving more for longer sequences.
Deepspeed PP would be great addition - I am certainly going to try it out for my own curiosity.
From this thread I don't know what GPUs you are using but command-r 35b qlora at 4 bit, rank 32 all linear layers I would expect ~30-35GB for 4096 tokens at bs=1 and gradient checkpointing
I have 2x24 GB VRAM.
⚠️ Please check that this feature request hasn't been suggested before.
🔖 Feature description
Recently I have been spending a lot of time playing with https://github.com/tdrussell/qlora-pipe and it has some features that are downright amazing.
✔️ Solution
Learn from and adopt his code design decisions in areas where this makes sense.
❓ Alternatives
No response
📝 Additional Context
No response
Acknowledgements