Open e-tornike opened 2 months ago
Hi,
Thanks for reaching out. Seems like you are not passing in a prompting template, which is essentially no template, and that could be a major reason for the discrepancy.
We provide an example template for Llama 3 series in the latest push so try again with --chat_template llama3 \
Hey,
thanks for your response.
As I understand it, the template is actually passed as an argument to the --include_path
flag (see here). This is what my code above is passing directly to lm_eval
.
Your recent update #3 is now adding the --apply_chat_template
flag.
i see. general
template in the repo basically means no template, which directly dumps the dataset input to the model as shown here. you might want to include llama template or try apply_chat_template
True, but I guess the real question I am asking is if the numbers in the preprint (https://arxiv.org/abs/2406.07835) used no template (i.e., general
), as described in the README, or one specific for the underlying model. Do you know which is the case?
@e-tornike Thanks for pointing this out. For your question, we did not use any template for Llama models in the preprint (which is not correct). Our results reported misaligned with yours because we did not set the EOS token correctly for non-tulu finetuned model, so the model would keep generating and repeat itself until hitting the max tokens. Sometimes the model generates a few versions of answer, which mistakenly push the score higher after parsing. Your results without template are correct (I reproduced the same score w/o templates). The right way of doing these experiments is using templates.
Here are our new scores on Llama2-7B and Llama3-8B with corresponding templates for reference (Tulu models in the preprint are fine): | Task | bioasq | biored | discomat | evidence_inference | multicite | mup | qasper | qasper | scierc | scifact | scifact | Mean | Median |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Metric | f1 | f1 | bleu | f1_overlap | f1 | lm_judge_reference | lm_judge_answer | f1_evidence | f1 | f1_label | f1_evidence_token | |||
Llama-2-7b-chat-hf | 34.48 | 19.78 | 35.57 | 13.32 | 27.27 | 77.25 | 7.94 | 2.29 | 6.88 | 50.41 | 31.68 | 27.90 | 27.27 | |
Meta-Llama-3-8B-Instruct | 44.28 | 47.06 | 59.47 | 0.15 | 50.08 | 85.50 | 55.14 | 41.19 | 28.76 | 68.36 | 53.36 | 48.49 | 50.08 |
This EOS issue won't happen in the recent push with new lm-eval
dependency. In our latest version of the paper, we have already remade this table and will update our preprint soon.
Thanks again for pointing out. Let me know if you have further questions!
Hey there,
firstly, thanks for the nice work!
I am attempting to reproduce the results from the paper. I re-ran the experiments with 10 seeds (averaging the results). However, I am only reproducing the numbers for 5 of 7 of the tasks, which do not require an LLM judge.
My results are the following:
I am uncertain what the reason could be that the reproduced results for SciERC and SciFact are different compared to the original. Do you know what could be the cause of this?
There is a slight change in the
--model_args
due to memory issues. I addedgpu_memory_utilization
ormax_model_len
and removedtensor_parallel_size
. I am running the following command:And I am using the following seeds: 42, 1337, 9876, 12345, 999999, 98765, 5555, 2024, 267, 10.
I am running experiments on a single RTX A6000 (48 GB) using CUDA 12.4, driver version 550.90.07, and Python 3.10.13 with the following versions of the packages: