Reproducing Llama results from the paper

e-tornike commented 2 months ago

Hey there,

firstly, thanks for the nice work!

I am attempting to reproduce the results from the paper. I re-ran the experiments with 10 seeds (averaging the results). However, I am only reproducing the numbers for 5 of 7 of the tasks, which do not require an LLM judge.

My results are the following:

	BioASQ	BioRED	DiscMT	EvInf	MultiCite	SciERC	SciFact	SciFact
	F1	F1	BLEU	"Fuzzy" F1	F1	F1	F1-Label	F1-Token
Llama-3-8B-Inst. (original)	43.3	40.3	37.3	13.5	37.9	25.4	42.3	40.1
Llama-3-8B-Inst. (reproduced)	43.1 ± 1.5	40.1 ± 1.1	36.8 ± 2.6	15.2 ± 1.7	34.9 ± 8.2	13.2 ± 4.6	22.0 ± 7.7	20.6 ± 7.2

I am uncertain what the reason could be that the reproduced results for SciERC and SciFact are different compared to the original. Do you know what could be the cause of this?

There is a slight change in the --model_args due to memory issues. I added gpu_memory_utilization or max_model_len and removed tensor_parallel_size. I am running the following command:

python -m lm_eval \
  --include_path ./sciriff/eval/eleuther_templates/general \
  --model vllm \
  --model_args pretrained=meta-llama/Meta-Llama-3-8B-Instruct,dtype=float16,gpu_memory_utilization=0.85,max_model_len=5120 \
  --gen_kwargs max_gen_toks=1024 \
  --tasks bioasq_list_qa,biored_ner,discomat_te,evidence_inference,multicite_intent_classification,scierc_ner,scifact_entailment \
  --batch_size auto \
  --output_path results/ \
  --seed 42 \
  --predict_only \
  --log_samples

And I am using the following seeds: 42, 1337, 9876, 12345, 999999, 98765, 5555, 2024, 267, 10.

I am running experiments on a single RTX A6000 (48 GB) using CUDA 12.4, driver version 550.90.07, and Python 3.10.13 with the following versions of the packages:

huggingface-hub==0.24.5
jinja2==3.1.4
jsonschema==4.23.0
https://github.com/EleutherAI/lm-evaluation-harness.git@e74ec966556253fbe3d8ecba9de675c77c075bce
nltk==3.8.1
openai==1.37.2
pandas==2.2.2
pyyaml==6.0.1
rouge_score==0.1.2
spacy==3.7.5
vllm==0.5.4

lihaoxin2020 commented 1 month ago

Hi,

Thanks for reaching out. Seems like you are not passing in a prompting template, which is essentially no template, and that could be a major reason for the discrepancy.

We provide an example template for Llama 3 series in the latest push so try again with --chat_template llama3 \

e-tornike commented 1 month ago

Hey,

thanks for your response.

As I understand it, the template is actually passed as an argument to the --include_path flag (see here). This is what my code above is passing directly to lm_eval.

Your recent update #3 is now adding the --apply_chat_template flag.

lihaoxin2020 commented 1 month ago

i see. general template in the repo basically means no template, which directly dumps the dataset input to the model as shown here. you might want to include llama template or try apply_chat_template

e-tornike commented 1 month ago

True, but I guess the real question I am asking is if the numbers in the preprint (https://arxiv.org/abs/2406.07835) used no template (i.e., general), as described in the README, or one specific for the underlying model. Do you know which is the case?

lihaoxin2020 commented 1 month ago

@e-tornike Thanks for pointing this out. For your question, we did not use any template for Llama models in the preprint (which is not correct). Our results reported misaligned with yours because we did not set the EOS token correctly for non-tulu finetuned model, so the model would keep generating and repeat itself until hitting the max tokens. Sometimes the model generates a few versions of answer, which mistakenly push the score higher after parsing. Your results without template are correct (I reproduced the same score w/o templates). The right way of doing these experiments is using templates.

Here are our new scores on Llama2-7B and Llama3-8B with corresponding templates for reference (Tulu models in the preprint are fine):	Task	bioasq	biored	discomat	evidence_inference	multicite	mup	qasper	qasper	scierc	scifact	scifact	Mean
Metric	f1	f1	bleu	f1_overlap	f1	lm_judge_reference	lm_judge_answer	f1_evidence	f1	f1_label	f1_evidence_token
Llama-2-7b-chat-hf	34.48	19.78	35.57	13.32	27.27	77.25	7.94	2.29	6.88	50.41	31.68	27.90	27.27
Meta-Llama-3-8B-Instruct	44.28	47.06	59.47	0.15	50.08	85.50	55.14	41.19	28.76	68.36	53.36	48.49	50.08

This EOS issue won't happen in the recent push with new lm-eval dependency. In our latest version of the paper, we have already remade this table and will update our preprint soon.

Thanks again for pointing out. Let me know if you have further questions!

allenai / SciRIFF

Reproducing Llama results from the paper #2