Closed ZeroneBo closed 1 year ago
Hi, thank you for your interest in the paper!
There are a few points that might assist you:
Have you installed the environments using install_alma.sh? Please check the versions of your transformers, deepspeed, accelerate, and peft packages.
If all packages were installed using install_alma.sh, consider attempting to reproduce the results using our released model to verify if you can achieve the same outcomes.
Ensure that you are evaluating your best model, which should have the lowest valid loss. If you have set --save_total_limit 1
, the best saved model will be named your_out_dir/checkpoint-{some_step}
in your directory. Confirm that you are evaluating this checkpoint.
Let me know if you have further questions!
Thank you!
My environment was setted up using install_alma.sh
, my version is as following:
transformers 4.34.1
deepspeed 0.10.0
accelerate 0.21.0
peft 0.4.0
Reproduce Using Our Released Model
I am running this model.
Get the Best Checkpoint
I think this is the problem, I trained the LoRA model and generate the translation using runs/parallel_ft_lora.sh
, which scripts may do predict directly loading the last model. I am trying to specify the output_dir/checkpoint-{step}
to regenerate translation.
I see there are 3 adapter_model.bin
, one is in output_dir/
, one is in output_dir/checkpoint-{step}/
, and the last is in output_dir/checkpoint-{step}/adapter_model/
, are they the same?
I will report my results when I finish my experiment. And thanks for answering my question.
You may want to use output_dir/checkpoint-{step}/adapter_model/
or output_dir/checkpoint-{step}
transformer version could be a minor issue since LLaMA implementation was changed a lot. You may want to roll back to 4.30.0.dev0
by pip install git+https://github.com/fe1ixxu/ALMA.git@hf-install
. This version can reproduce the exact the same results as we reported in the paper.
I have changed transformers
version to 4.30.0.dev0
, the same hyperparameters didn't produce a better score. Using the released LoRA model can exactly reproduce the same score with paper.
I tried a set of hyperparamters, finally got a best Avg score:
xx-en 33.97 BLEU , 84.11 COMET
en-xx 29.59 BLEU , 86.41 COMET.
Thanks for answering my question. And great work!
Hi, thanks for this great project. When I reproduced the results of the paper ALMA-7B-LoRA, some problems arose: I use human_written_data to finetune the pretrained ALMA-7B-pretrain model with LoRA following the paper's setting, which epoch is 2, my en-xx Avg COMET is 81.38, the paper is 86.37, and en-zh COMET is 62.58, the paper is 84.87.
When I follow this repo's setting, which epoch is 1, the result become normal, but has about 1 point lower than the paper’s BLEU and COMET. The paper's Avg result, xx-en 34.31BLEU and 84.12COMET, en-xx 29.78 BLEU and 86.37COMET. My result is, xx-en 33.14BLEU and 83.88COMET, en-xx 28.46 BLEU and 85.96COMET.
In my experiment, I changed the following setting to fit my device (only one 40G A100) and batch_size 256 (per_device_batch_size is 4 as default):
--gradient_accumulation_steps 64
inparallel_ft_lora.sh
,gradient_accumulation_steps: 64, num_processes: 1
indeepspeed_train_config.yaml
.Are there some problem with my setting?