Examples don't run on Lambdalabs without various additional/manual steps

jphme commented 1 year ago

I used the setup script in the referenced issue (see here) install the correct pthon and pytorch version.

However, I got additional errors when running the examples, e.g.

accelerate launch scripts/finetune.py examples/openllama-3b/qlora.yml

ERROR:root:Exception raised attempting to load model, retrying with AutoModelForCausalLM
ERROR:root:/home/ubuntu/.local/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cquantize_blockwise_fp16_nf4
Traceback (most recent call last):

or

accelerate launch scripts/finetune.py examples/openllama-3b/lora.yml

results in

ERROR:root:Exception raised attempting to load model, retrying with AutoModelForCausalLM
ERROR:root:/home/ubuntu/.local/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cget_col_row_stats

I´m not very experienced - are these axolotl issues or issues of the downstream libraries?

Edit:

After force-reinstalling tensorflow, protobuf and wandb, the examples are running again:

python -m pip install --upgrade --force-reinstall tensorflow "protobuf<3.21" wandb

Originally posted by @jpdus in https://github.com/OpenAccess-AI-Collective/axolotl/issues/242#issuecomment-1620549837

NanoCode012 commented 1 year ago

Hello, may I ask which GPU you are using?

It would seem that bitsandbytes are giving issues. Have you tried installing from source or downgrading one version?

This also kind of pushes the need of versioning #153

jphme commented 1 year ago

Hello, may I ask which GPU you are using?

It would seem that bitsandbytes are giving issues. Have you tried installing from source or downgrading one version?

This also kind of pushes the need of versioning #153

I was using a A100 40GB (via lambdalabs) and installed bitsandbytes via the requirements.txt:

accelerate==0.20.3
axolotl==0.1
bitsandbytes==0.39.1
peft==0.4.0.dev0

python_version: 3.9.17

After my experience I would also strongly support #153 , was about to give up...

NanoCode012 commented 1 year ago

@jpdus , may I ask if you have tried docker? I usually have best success with that.

I agree we need better versioning. It has been on the TODO for a long time..

NanoCode012 commented 1 year ago

Hello, did you manage to get this all setup and running? If so, can we close this issue?

jphme commented 1 year ago

Yes I managed to get everything up and running but spent the better part of 2 days on it ;-) .

Will close this, but would advice version pinning like proposed in #153 (or at least mentioning known "good" versions in the examples/tutorials; i created some requirements.txt lists and known "good" commits of the git installs for myself).

axolotl-ai-cloud / axolotl

Examples don't run on Lambdalabs without various additional/manual steps #260