Closed NanoCode012 closed 1 year ago
This is due to accelerate config setting to float16 or bloat16. If you match accelerate config's precision with yaml, the error will be solved.
@NanoCode012 I am facing the same issue when using accelerate library. Can you provide more details on how you are solving. Will help greatly!
Sure @anshsarkar .
I was testing a config here https://github.com/OpenAccess-AI-Collective/axolotl/blob/2ba4ae8f461c0c491f9ca303c134f9ad6f725e8c/examples/openllama-3b/config.yml on a machine where accelerate config
precision is set to bf16
or fp16
. This would cause a mismatch. I simply just changed the config to use None
instead and it worked.
Vice versa, you can change the config.yml to match your accelerate's config (Recommended).
Are you using axolotl or just accelerate in general?
@NanoCode012 I am using accelerate in general Thanks for the input. I will try this and see.
You can check your code where you cast to a type. Make sure it matches your accelerate's config. @anshsarkar
from accelerate import Accelerator
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, prepare_model_for_int8_training
from accelerate.utils import DistributedType
model_id = "ausboss/llama-30b-supercot"
kwargs = DistributedType("NO")
accelerator = Accelerator(device_placement=False, mixed_precision= "fp16" , cpu=False)
model = LlamaForCausalLM.from_pretrained(model_id, device_map=device_map_lm, load_in_8bit=True, torch_dtype=torch.float16)
model = prepare_model_for_int8_training(model)
model = accelerator.prepare(model)
training_arguments = transformers.TrainingArguments(
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
warmup_steps=100,
max_steps=20,
learning_rate=LEARNING_RATE,
fp16=True,
logging_steps=1,
optim="adamw_torch",
evaluation_strategy="steps",
save_strategy="steps",
eval_steps=1,
save_steps=1,
output_dir=OUTPUT_DIR,
save_total_limit=3,
load_best_model_at_end=True,
report_to="tensorboard",
ddp_find_unused_parameters=False,
# deepspeed = deepspeed_config
)
data_collator = transformers.DataCollatorForSeq2Seq(
tokenizer, pad_to_multiple_of=8, return_tensors="pt", padding=True
)
trainer = transformers.Trainer(
model=model,
train_dataset=train_data,
eval_dataset=val_data,
args=training_arguments,
data_collator=data_collator,
)
model.config.use_cache = False
old_state_dict = model.state_dict
model.state_dict = (
lambda self, *_, **__: get_peft_model_state_dict(
self, old_state_dict()
)
).__get__(model, type(model))
model = torch.compile(model)
trainer.train()
model.save_pretrained(OUTPUT_DIR)
@NanoCode012 for the above i am getting the error you encountered. Haven't been able to solve it yet. Your input will be really helpful. I am getting this error when I am trying to run trainer.train()
Also thanks in advance!
accelerator = Accelerator(device_placement=False, mixed_precision= "fp16" , cpu=False)
Train the above line to the proper precision that you set when you run accelerate config
Hi, I face same problem here, my code to create Accelerator is the following:
accelerator = Accelerator()
My yaml look like this:
hyperparameters:
dataloader_drop_last: True
evaluation_strategy: "epoch"
save_strategy: "epoch"
logging_strategy: "epoch"
num_train_epochs: 10
auto_find_batch_size: True
batch_size: 4
max_steps: 1000
eval_steps: 100
save_steps: 1000
logging_steps: 100
per_device_train_batch_size: 8
per_device_eval_batch_size: 8
learning_rate: 1e-5
lr_scheduler_type: "cosine"
warmup_steps: 2000
gradient_accumulation_steps: 1
gradient_checkpointing: True
sharded_ddp: False
fsdp: False
weight_decay: 0.0001
run_name: "CodeT5-seq2seq-fine-tuned"
ddp_find_unused_parameters: False
fp16: True
bf16: False
auto_find_batch: True
num_workers: 4
max_prediction_length: 512
beam_size: 5
max_grad_norm: 5.0
adam_epsilon : 1e-06
Even I remove the fp16, bf16 settings or set them to False, the error still exists. What should I do to make it work?
@Luxios22 , did you run accelerate config
to make sure it matches with this yaml?
@NanoCode012
I tried matching it with the yaml and setting to None
as well. Still getting the same error
I'm not sure then, sorry. That was what fixed for me. Make sure to try reinstall latest version of the packages as well.
Hmmmm, sure. Thanks for the help!!
@anshsarkar @Luxios22 I managed to get my script working by downgrading to transformers==4.29.2
, seems like there were some changes from v4.30.0 onwards that introduced this issue. I opened an issue on the HF/transformers repo if you want to track it https://github.com/huggingface/transformers/issues/24431
edit: actually, that downgrade probably won't work for you guys if you're manually creating the accelerator
object...so it seems like maybe it's a problem with accelerate
instead. regardless, hopefully the HF guys will look into it
@NanoCode012 This issue is not completed and should not be closed until at the very least there is an informative error message given.
(And for whatever it's worth, I'm still struggling with this)
Hey @enn-nafnlaus , did you try downgrade following Steven's advice? That worked for me on newer machines. Could you list your steps to reproduce? axolotl or not etc?
Hey @enn-nafnlaus , did you try downgrade following Steven's advice? That worked for me on newer machines. Could you list your steps to reproduce? axolotl or not etc?
Just got to that advice (was working my way down this thread) and it worked (might want to have that be the error message :) )... well, to the degree that it eliminated the 8th consecutive error in trying to get it to run. But now off to the 9th, which is three separate consecutive errors (that happen regardless of whether my yaml has load_in_4bit enabled):
ValueError: Autograd4bitQuantLinear() does not have a parameter or a buffer named weight. ValueError: Autograd4bitQuantLinear() does not have a parameter or a buffer named weight. TypeError: LlamaForCausalLM.init() got an unexpected keyword argument 'load_in_4bit'
Surely unrelated to the func error though.
Yes @enn-nafnlaus . This is the caveat of this method. 4bit wasn't implemented in this version.. If you're not using it, you can simply comment out the lines. Otherwise, I'm not sure how else.. It seems to be more of a general issue than axolotl's as others above receive it despite not using axolotl.
Yes @enn-nafnlaus . This is the caveat of this method. 4bit wasn't implemented in this version.. If you're not using it, you can simply comment out the lines. Otherwise, I'm not sure how else.. It seems to be more of a general issue than axolotl's as others above receive it despite not using axolotl.
The problem is, I don't even know "what I need" as far as configs go. Here's my goals:
So I'm trying to follow the example and the "guide" on the axolotl project page, but there's tons of parameters and config options, and everything seems to lead down a "that's broken, with an obscure error message" road. :(
So should I be commenting out some line of code? Which lines of code? There's a whole stacktrace:
Traceback (most recent call last): File "/home/username/axolotl/src/axolotl/utils/models.py", line 194, in loadmodel model, = load_llama_model_4bit_low_ram( File "/home/username/.local/lib/python3.10/site-packages/alpaca_lora_4bit/autograd_4bit.py", line 249, in load_llama_model_4bit_low_ram model = accelerate.load_checkpoint_and_dispatch( File "/home/username/.local/lib/python3.10/site-packages/accelerate/big_modeling.py", line 486, in load_checkpoint_and_dispatch load_checkpoint_in_model( File "/home/username/.local/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 1116, in load_checkpoint_in_model set_module_tensor_to_device(model, param_name, param_device, value=param, dtype=dtype) File "/home/username/.local/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 149, in set_module_tensor_to_device raise ValueError(f"{module} does not have a parameter or a buffer named {tensor_name}.") ValueError: Autograd4bitQuantLinear() does not have a parameter or a buffer named weight. Traceback (most recent call last): File "/home/username/axolotl/src/axolotl/utils/models.py", line 194, in loadmodel model, = load_llama_model_4bit_low_ram( File "/home/username/.local/lib/python3.10/site-packages/alpaca_lora_4bit/autograd_4bit.py", line 249, in load_llama_model_4bit_low_ram model = accelerate.load_checkpoint_and_dispatch( File "/home/username/.local/lib/python3.10/site-packages/accelerate/big_modeling.py", line 486, in load_checkpoint_and_dispatch load_checkpoint_in_model( File "/home/username/.local/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 1116, in load_checkpoint_in_model set_module_tensor_to_device(model, param_name, param_device, value=param, dtype=dtype) File "/home/username/.local/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 149, in set_module_tensor_to_device raise ValueError(f"{module} does not have a parameter or a buffer named {tensor_name}.") ValueError: Autograd4bitQuantLinear() does not have a parameter or a buffer named weight.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/username/axolotl/scripts/finetune.py", line 352, in
But this is getting off topic - I'll start a new thread :) The only thing that applies to this thread is that a more helpful error message is needed.
New thread. https://github.com/OpenAccess-AI-Collective/axolotl/issues/259
So I've been doing more testing, and regardless of peft / gptq install status, doing the transformers==4.29.2 downgrade always leads directly into:
_TypeError: LlamaForCausalLM.init() got an unexpected keyword argument 'load_in4bit'
Indeed, while I can train the example lora yml (not the example model yml) if I do the right combination of install steps (no deepspeed, no torch compiling, no low-bit floating point config, and the peft-pull rather than the gptq requirements install), once I do the downgrade, neither loras nor models can be trained - both hit the above error.
TypeError: LlamaForCausalLM.init() got an unexpected keyword argument 'load_in_4bit'
Hello @enn-nafnlaus , sorry for late reply. I forgot to mention which lines. It's this mainly this line here as axolotl has been written for newer transformers version so this hack is needed unfortunately. https://github.com/OpenAccess-AI-Collective/axolotl/blob/b9b7d4ce9292739d7bd3b6113e54786f45db7462/src/axolotl/utils/models.py#L213
Do note: commenting this out would not allow QLORA. Make sure that when you do pip install
to use -e
as stated in docs.
Edit: Feel free to discuss this in other thread which seems to be more appropriate than this.
I came across this again when I needed to use newer version of transformers. The thing that fixed it now is using latest accelerate.
transformers==4.31.0
accelerate==0.21.0
Edit: See here for one solution: https://github.com/OpenAccess-AI-Collective/axolotl/issues/195#issuecomment-1603189889
I'm noticing crash in latest git commit.
safe:
01248253a3e8aedba6d473469dc839cd368bfe3c
crash:f31a338cbbdcd76a5af35e400eb9e0e8cae36b72
Command:
accelerate launch scripts/finetune.py examples/openllama-3b/config.yml
(point to proper config depend on commit)