Open janchk opened 2 weeks ago
accelerate launch -m axolotl.cli.train examples/openllama-3b/lora.yml
taking ~6 hrs on RTX A6000Axolotl
- Uses pytorch as backbone
- Does not support CPU finetuning
- Basic example of finetuning
accelerate launch -m axolotl.cli.train examples/openllama-3b/lora.yml
taking ~6 hrs on RTX A6000
Though it is possible to build bitsandbytes package without GPU support, and it's also possible to install CPU version of pytorch, disable flashattention library, that requires cuda. The cascade of errors still goes on..
Axolotl
- Uses pytorch as backbone
- Does not support CPU finetuning
- Basic example of finetuning
accelerate launch -m axolotl.cli.train examples/openllama-3b/lora.yml
taking ~6 hrs on RTX A6000Though it is possible to build bitsandbytes package without GPU support, and it's also possible to install CPU version of pytorch, disable flashattention library, that requires cuda. The cascade of errors still goes on..
Steps to possibly run on CPU
UNFORTUNATELY AMD processors are not supported at he moment, can't test CPU inplementation right now.
CUDA_VISIBLE_DEVICES="" llamafactory-cli train examples/train_lora/llama3_lora_sft.yaml
[INFO|trainer.py:698] 2024-11-18 13:45:32,215 >> Using cpu_amp half precision backend
[INFO|trainer.py:2313] 2024-11-18 13:45:32,647 >> ***** Running training *****
[INFO|trainer.py:2314] 2024-11-18 13:45:32,647 >> Num examples = 981
[INFO|trainer.py:2315] 2024-11-18 13:45:32,647 >> Num Epochs = 3
[INFO|trainer.py:2316] 2024-11-18 13:45:32,647 >> Instantaneous batch size per device = 1
[INFO|trainer.py:2319] 2024-11-18 13:45:32,648 >> Total train batch size (w. parallel, distributed & accumulation) = 8
[INFO|trainer.py:2320] 2024-11-18 13:45:32,648 >> Gradient Accumulation steps = 8
[INFO|trainer.py:2321] 2024-11-18 13:45:32,648 >> Total optimization steps = 366
[INFO|trainer.py:2322] 2024-11-18 13:45:32,652 >> Number of trainable parameters = 20,971,520
0%| | 0/366 [00:00<?, ?it/s]/home/user/LLaMA-Factory/venv/lib/python3.10/site-packages/transformers/trainer.py:3536: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
ctx_manager = torch.cpu.amp.autocast(cache_enabled=cache_enabled, dtype=self.amp_dtype)
Need to experiment with this solutions and decide if lora can be implemented on CPU