Full-Parameter Fine-Tuning on commonsense - Githubissues

AGI-Edgerunners / LLM-Adapters

Code for our EMNLP 2023 Paper: "LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models"

https://arxiv.org/abs/2304.01933

Apache License 2.0

1.08k stars 103 forks source link

Full-Parameter Fine-Tuning on commonsense #62

Closed lucasliunju closed 6 months ago

lucasliunju commented 7 months ago

Hi, May I ask how to conduct the experiments about Full-Parameter Fine-Tuning on commonsense with LLM-Adapters.

lucasliunju commented 7 months ago

Hi, I find full-parameter fine-tuning is not stable with torch.float16. I change to float32 and tune learning rate, then the performance will be good.

HZQ950419 commented 7 months ago

Hi, For full fine-tuning, you can change fp16 training to bf16 training should also work.

lucasliunju commented 7 months ago

Hi @HZQ950419

Thanks for your reply. May I ask how to set "adapter_name" for full-parameter fine-tuning.

Thanks

HZQ950419 commented 7 months ago

For full fine-tuning, you may refer to https://github.com/HZQ950419/LLM-Adapters/blob/main/full_finetune.py.

lucasliunju commented 7 months ago

Hi @HZQ950419 Thanks for your reply. May I ask another question: I try to change "torch.float16" into "torch.float32" in finetune.py :

model = AutoModelForCausalLM.from_pretrained(
            base_model,
            load_in_8bit=False,
            torch_dtype=torch.float16,
            device_map={"": int(os.environ.get("LOCAL_RANK") or 0)},
            trust_remote_code=True,
        )

But I find the accuracy will be lower than the result with torch.float16. Do you think I still need to tune the hyperparameters.

Best

lucasliunju commented 7 months ago

Hi @HZQ950419 When I try to evaluate on boolq task, I find sometimes the prediction is none and that will cause the test accuracy ver low. When I try to change the code from outputs = [o.split("### Response:")[1].strip() for o in outputs] to outputs = [o.split("### Response:")[-1].strip() for o in outputs] It can work well.

lucasliunju commented 7 months ago

I find the output can be output: ['Below is an instruction that describes a task. Write a response that appropriately completes the request. \n\n ### Instruction:\n Please answer the following question with true or false, question: is the enchanted forest in oregon still open?\n\nAnswer format: true/false\n\n ### Response:\n 20000000000000000000000000000000']

That will cause the predict be none.

wutaiqiang commented 6 months ago

I find the output can be output: ['Below is an instruction that describes a task. Write a response that appropriately completes the request. \n\n ### Instruction:\n Please answer the following question with true or false, question: is the enchanted forest in oregon still open?\n\nAnswer format: true/false\n\n ### Response:\n 20000000000000000000000000000000']

That will cause the predict be none.

I meet the same issue, thanks for your kind response~

lucasliunju commented 6 months ago

Hi @wutaiqiang

I find the main reason is that the saved model is not the best model. This code tries to use validation loss to save the best model to evaluate. But I find there is a validation loss spike in about 21k steps and sometimes the final val loss is still larger than the model in 21k. That will cause we use the model in 21k to evaluate, and I think this model is not fully trained. If the final val loss is smaller than the loss in 21k and that means we can save a fully trained model (about the model in 31k). I find the test accuracy will be better.

lucasliunju commented 6 months ago

BTW, I guess that is the main reason.

HZQ950419 commented 6 months ago

Hi @wutaiqiang

I find the main reason is that the saved model is not the best model. This code tries to use validation loss to save the best model to evaluate. But I find there is a validation loss spike in about 21k steps and sometimes the final val loss is still larger than the model in 21k. That will cause we use the model in 21k to evaluate, and I think this model is not fully trained. If the final val loss is smaller than the loss in 21k and that means we can save a fully trained model (about the model in 31k). I find the test accuracy will be better.

Hi @lucasliunju , to avoid this issue, you can set the val_set_size to 0. The code should save the fully trained model.

lucasliunju commented 6 months ago

Hi @HZQ950419 Thank you very much for your reply. May I ask how to use the commonsense_15k dataset, whether we need to tune some hyper-parameters?

Best

HZQ950419 commented 6 months ago

Hi @HZQ950419 Thank you very much for your reply. May I ask how to use the commonsense_15k dataset, whether we need to tune some hyper-parameters?

Best

The commonsense_15k dataset is a subset of commonsense_170k, which was used for debugging. If you want to use this dataset, it should be the same as commonsense_170k, but the performance may not be that good.

lucasliunju commented 6 months ago

Thank you very much for your answer!

lucasliunju commented 6 months ago

For full fine-tuning, you may refer to https://github.com/HZQ950419/LLM-Adapters/blob/main/full_finetune.py.

Hi @HZQ950419 May I ask the hyper-parameters (such as lr in adafactor) when I try to use this codebase for full fine-tuning.

lucasliunju commented 6 months ago

I find the output can be output: ['Below is an instruction that describes a task. Write a response that appropriately completes the request. \n\n ### Instruction:\n Please answer the following question with true or false, question: is the enchanted forest in oregon still open?\n\nAnswer format: true/false\n\n ### Response:\n 20000000000000000000000000000000'] That will cause the predict be none.

I meet the same issue, thanks for your kind response~

Hi @wutaiqiang I would like to ask:Do you find how to solve this issue? I find even if I save the fully trained model (the last checkpoint), the output can still be none and like this result. The test accuracy will be about 0.5.