Closed lucasliunju closed 6 months ago
Hi, I find full-parameter fine-tuning is not stable with torch.float16. I change to float32 and tune learning rate, then the performance will be good.
Hi, For full fine-tuning, you can change fp16 training to bf16 training should also work.
Hi @HZQ950419
Thanks for your reply. May I ask how to set "adapter_name" for full-parameter fine-tuning.
Thanks
For full fine-tuning, you may refer to https://github.com/HZQ950419/LLM-Adapters/blob/main/full_finetune.py.
Hi @HZQ950419 Thanks for your reply. May I ask another question: I try to change "torch.float16" into "torch.float32" in finetune.py :
model = AutoModelForCausalLM.from_pretrained(
base_model,
load_in_8bit=False,
torch_dtype=torch.float16,
device_map={"": int(os.environ.get("LOCAL_RANK") or 0)},
trust_remote_code=True,
)
But I find the accuracy will be lower than the result with torch.float16. Do you think I still need to tune the hyperparameters.
Best
Hi @HZQ950419 When I try to evaluate on boolq task, I find sometimes the prediction is none and that will cause the test accuracy ver low. When I try to change the code from outputs = [o.split("### Response:")[1].strip() for o in outputs]
to outputs = [o.split("### Response:")[-1].strip() for o in outputs]
It can work well.
I find the output can be output: ['Below is an instruction that describes a task. Write a response that appropriately completes the request. \n\n ### Instruction:\n Please answer the following question with true or false, question: is the enchanted forest in oregon still open?\n\nAnswer format: true/false\n\n ### Response:\n 20000000000000000000000000000000']
That will cause the predict be none.
I find the output can be
output: ['Below is an instruction that describes a task. Write a response that appropriately completes the request. \n\n ### Instruction:\n Please answer the following question with true or false, question: is the enchanted forest in oregon still open?\n\nAnswer format: true/false\n\n ### Response:\n 20000000000000000000000000000000']
That will cause the predict be none.
I meet the same issue, thanks for your kind response~
Hi @wutaiqiang
I find the main reason is that the saved model is not the best model. This code tries to use validation loss to save the best model to evaluate. But I find there is a validation loss spike in about 21k steps and sometimes the final val loss is still larger than the model in 21k. That will cause we use the model in 21k to evaluate, and I think this model is not fully trained. If the final val loss is smaller than the loss in 21k and that means we can save a fully trained model (about the model in 31k). I find the test accuracy will be better.
BTW, I guess that is the main reason.
Hi @wutaiqiang
I find the main reason is that the saved model is not the best model. This code tries to use validation loss to save the best model to evaluate. But I find there is a validation loss spike in about 21k steps and sometimes the final val loss is still larger than the model in 21k. That will cause we use the model in 21k to evaluate, and I think this model is not fully trained. If the final val loss is smaller than the loss in 21k and that means we can save a fully trained model (about the model in 31k). I find the test accuracy will be better.
Hi @lucasliunju , to avoid this issue, you can set the val_set_size to 0. The code should save the fully trained model.
Hi @HZQ950419 Thank you very much for your reply. May I ask how to use the commonsense_15k dataset, whether we need to tune some hyper-parameters?
Best
Hi @HZQ950419 Thank you very much for your reply. May I ask how to use the commonsense_15k dataset, whether we need to tune some hyper-parameters?
Best
The commonsense_15k dataset is a subset of commonsense_170k, which was used for debugging. If you want to use this dataset, it should be the same as commonsense_170k, but the performance may not be that good.
Thank you very much for your answer!
For full fine-tuning, you may refer to https://github.com/HZQ950419/LLM-Adapters/blob/main/full_finetune.py.
Hi @HZQ950419 May I ask the hyper-parameters (such as lr in adafactor) when I try to use this codebase for full fine-tuning.
I find the output can be
output: ['Below is an instruction that describes a task. Write a response that appropriately completes the request. \n\n ### Instruction:\n Please answer the following question with true or false, question: is the enchanted forest in oregon still open?\n\nAnswer format: true/false\n\n ### Response:\n 20000000000000000000000000000000']
That will cause the predict be none.I meet the same issue, thanks for your kind response~
Hi @wutaiqiang I would like to ask:Do you find how to solve this issue? I find even if I save the fully trained model (the last checkpoint), the output can still be none and like this result. The test accuracy will be about 0.5.
Hi, May I ask how to conduct the experiments about Full-Parameter Fine-Tuning on commonsense with LLM-Adapters.