ContextualAI / gritlm

Generative Representational Instruction Tuning
https://arxiv.org/abs/2402.09906
MIT License
478 stars 33 forks source link

When training a unified model, TypeError: MistralForCausalLM.forward() got an unexpected keyword argument 'is_causal' #41

Open zillion-zhao opened 2 weeks ago

zillion-zhao commented 2 weeks ago

Hello!

I meet a problem when I train the model in the unified mode.

First, I would like to share that when I evaluate several models in the artifacts (for example bbcc-mean, cccc-lasttoken, and cccc-wmean), it is also shown that info: TypeError: MistralForCausalLM.forward() got an unexpected keyword argument 'is_causal'.

To tackle the problem, I deem that only when the model is loaded with the class of MistralForCausalLM in modeling_gritlm7b.py, the is_causal argument is meaningful. Otherwise, it I do not put the modeling_gritlm7b.py in the model directory, the model is loaded as the MistralForCausalLM in the transformers lib, which do not have "is_causal". Besides, I think that the model config file should also be modified by adding: "auto_map": { "AutoModel": "modeling_gritlm7b.MistralModel", "AutoModelForCausalLM": "modeling_gritlm7b.MistralForCausalLM", "AutoModelForSequenceClassification": "modeling_gritlm7b.MistralForSequenceClassification" },

I fix this issue for evaluation by executing the behaviors above and it succeeds. However, I meet the same question when I train the model. I download Mistral-7B, add modeling_gritlm7b.py, and modify the config file. However, it still shows TypeError: MistralForCausalLM.forward() got an unexpected keyword argument 'is_causal'.

I guess maybe the model is not loaded correctly, so I print the type of the model in the run.py after loading the model:

model = GritLMTrainModel(
    model_name_or_path=model_args.model_name_or_path,
    normalized=model_args.normalized,
    pooling_method=model_args.pooling_method,
    negatives_cross_device=training_args.negatives_cross_device,
    temperature=training_args.temperature,
    mode=training_args.mode,
    projection=model_args.projection,
    attn=model_args.attn,
    attn_implementation=model_args.attn_implementation,
    torch_dtype=args_to_dtype(training_args),
    loss_gen_type=training_args.loss_gen_type,
    loss_gen_factor=training_args.loss_gen_factor,
    use_cache=False,
    # Critical to make Mixtral work
    low_cpu_mem_usage=True,
    quantization_config=quantization_config,
    load_in_4bit=load_in_4bit,
)
**print(type(model.model))**

The result is <class 'transformers_modules.Mistral-7B.modeling_gritlm7b.MistralForCausalLM'>, which is correct. So I want to know what is the problem? How can I modify some codes to make it work?

The training command: torchrun --nproc_per_node 1 \ -m training.run \ --output_dir output_dir \ --model_name_or_path ../models/Mistral-7B \ --train_data ../data/unified_data \ --learning_rate 1e-5 \ --num_train_epochs 5 \ --per_device_train_batch_size 5 \ --per_device_generative_bs 1 \ --dataloader_drop_last True \ --normalized True \ --temperature 0.02 \ --query_max_len 32 \ --passage_max_len 128 \ --train_group_size 2 \ --mode unified \ --max_steps 1253 \ --attn cccc \ --overwrite_output_dir \ --lora

Waiting for your kind reply! :)

Muennighoff commented 2 weeks ago

If you are certain you are using https://github.com/ContextualAI/gritlm/blob/main/scripts/modeling_mistral_gritlm.py or https://huggingface.co/GritLM/GritLM-7B/blob/main/modeling_gritlm7b.py , then I am not sure what the problem is. Maybe try pip show transformers and replace the modeling_mistral.py file with one of the correct Python files. Else this seems like a simple issue that can just be solved by debugging with print statements.

zillion-zhao commented 2 weeks ago

Yes, maybe there are some small problems. I try to print the type of the model in the training/model.py:

def encode(self, features):
    print(type(self.model))

and it shows: <class 'peft.peft_model.PeftModel'>

Maybe Lora influence the model type? I am not clear about it. Do you train the model in a full fine-tuning manner?

zillion-zhao commented 2 weeks ago

When I remove --lora, it shows CUDA out of memory^ ^. Maybe it is really due to the Lora. Maybe I could use more GPUs, but why the Lora influence the model type?

Muennighoff commented 2 weeks ago

I see, yes it could be because of Lora. I think that the Peft library wraps the transformer model and this could change the kwargs that are passed through. You may need to change something in https://github.com/huggingface/peft/blob/main/src/peft/tuners/lora/model.py to pass it through.

We do full fine-tuning; I haven't really tried Lora with GRIT.

zillion-zhao commented 2 weeks ago

I see. Thank you for your reply!