BangLab-UdeM-Mila / NLP4MatSci-HoneyBee

This repository contains the implementation for our EMNLP 2023 paper: HoneyBee: Progressive Instruction Finetuning of Large Language Models for Materials Science
22 stars 3 forks source link

Unable to finetune Honeybee/Llama2 with matsci-nlp data. #3

Open wiseyy opened 9 months ago

wiseyy commented 9 months ago

Hi, while finetuning Llama2 with matsci-nlp data provided in the repo, I faced the following issues.

I finetuned the model using the code given in uniform_finetune.py.

  1. The default cutoff length for the prompt/generated text is 512. But there were plenty of examples that exceeded this limit, so I had to increase it to either 1024 or 2048. Was the original model finetuned with 512 itself? If yes, how?
  2. The finetuning is done in a Seq2Seq manner and not for text generation. Why is that? Even after finetuning, I am not getting the output that is easy to evaluate according to the scripts given to the repo. The output is not a single word (named entity in case of NER).
Screenshot 2024-01-19 at 10 32 49 AM

Can you please clarify the above points, and let us know if you performed additional post processing before evaluating Honeybee's output on the matsci-nlp dataset?

yusonghust commented 9 months ago
  1. No, the original model finetuned with 2048, so you can change the hyper-params
  2. The finetuning is done for text generation, you can check the code carefully. The Llama is a decoder-only language model, it can only predict the next token based on previous tokens.
wiseyy commented 9 months ago
Screenshot 2024-01-21 at 7 52 23 AM
  1. Why do you use Seq2Seq Collator here and not DataCollatorForLanguageModeling? I think it is for computing the loss only on the answer, given the instruction and the input, and the behaviour will be the same (should be, atleast) if you change the attention mask to attend to the tokens that follow after "###Response: ". Can you please confirm this?

  2. Do you use the prompts as given in uniform_finetune.py to finetune over matsci-nlp.py as well?

Thank you for your help!

yusonghust commented 9 months ago

1、please refer to https://github.com/tloen/alpaca-lora/issues/412 2、yes and it is finetuned under low-resource setting