explore superstructures on llama.cpp for lora introduction

janchk commented 2 weeks ago

Need to experiment with this solutions and decide if lora can be implemented on CPU

[x] unsloth
[x] axolotl
[x] llama factory
[x] text-gen-ui *

janchk commented 1 week ago

Unsloth

Unsloth is neither superstructure nor wrapper around llamacpp. It may use llamacpp, but only for inference.
Unsloth does not provide support for finetuning on CPU.https://github.com/unslothai/unsloth/issues/477

janchk commented 1 week ago

Axolotl

Uses pytorch as backbone
Does not support CPU finetuning
Basic example of finetuning accelerate launch -m axolotl.cli.train examples/openllama-3b/lora.yml taking ~6 hrs on RTX A6000

janchk commented 1 week ago

Axolotl

Uses pytorch as backbone

Does not support CPU finetuning

Basic example of finetuning accelerate launch -m axolotl.cli.train examples/openllama-3b/lora.yml taking ~6 hrs on RTX A6000

Though it is possible to build bitsandbytes package without GPU support, and it's also possible to install CPU version of pytorch, disable flashattention library, that requires cuda. The cascade of errors still goes on..

janchk commented 4 days ago

Axolotl

Uses pytorch as backbone

Does not support CPU finetuning

Basic example of finetuning accelerate launch -m axolotl.cli.train examples/openllama-3b/lora.yml taking ~6 hrs on RTX A6000

Though it is possible to build bitsandbytes package without GPU support, and it's also possible to install CPU version of pytorch, disable flashattention library, that requires cuda. The cascade of errors still goes on..

Steps to possibly run on CPU

Compile bitsandbytes from source https://huggingface.co/docs/bitsandbytes/main/en/installation?backend=Intel+CPU+%2B+GPU#multi-backend-compile
Install CPU version of pytorch
Configure accelerate to use CPU only.

UNFORTUNATELY AMD processors are not supported at he moment, can't test CPU inplementation right now.

janchk commented 4 days ago

Llama factory

Based on pytorch

Possible to launch on CPU, but only one thread is performing and no observable finetuning progress visible. Further investigation needed..

Command

CUDA_VISIBLE_DEVICES="" llamafactory-cli train examples/train_lora/llama3_lora_sft.yaml

END OF Response

[INFO|trainer.py:698] 2024-11-18 13:45:32,215 >> Using cpu_amp half precision backend
[INFO|trainer.py:2313] 2024-11-18 13:45:32,647 >> ***** Running training *****
[INFO|trainer.py:2314] 2024-11-18 13:45:32,647 >>   Num examples = 981
[INFO|trainer.py:2315] 2024-11-18 13:45:32,647 >>   Num Epochs = 3
[INFO|trainer.py:2316] 2024-11-18 13:45:32,647 >>   Instantaneous batch size per device = 1
[INFO|trainer.py:2319] 2024-11-18 13:45:32,648 >>   Total train batch size (w. parallel, distributed & accumulation) = 8
[INFO|trainer.py:2320] 2024-11-18 13:45:32,648 >>   Gradient Accumulation steps = 8
[INFO|trainer.py:2321] 2024-11-18 13:45:32,648 >>   Total optimization steps = 366
[INFO|trainer.py:2322] 2024-11-18 13:45:32,652 >>   Number of trainable parameters = 20,971,520
0%|                                                                                                                       | 0/366 [00:00<?, ?it/s]/home/user/LLaMA-Factory/venv/lib/python3.10/site-packages/transformers/trainer.py:3536: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
ctx_manager = torch.cpu.amp.autocast(cache_enabled=cache_enabled, dtype=self.amp_dtype)

janchk commented 4 days ago

text-gen-ui

Just a web wrapper over torch for lora fine-tuning, nothing fancy.
Also a wrapper for llama.cpp for model inference, as a part of other engines.

aifoundry-org / wiki