allenai / SciRIFF

Dataset and evaluation suite enabling LLM instruction-following for scientific literature understanding.
Apache License 2.0
28 stars 3 forks source link

Reproducing Llama results from the paper #2

Open e-tornike opened 2 months ago

e-tornike commented 2 months ago

Hey there,

firstly, thanks for the nice work!

I am attempting to reproduce the results from the paper. I re-ran the experiments with 10 seeds (averaging the results). However, I am only reproducing the numbers for 5 of 7 of the tasks, which do not require an LLM judge.

My results are the following:

BioASQ BioRED DiscMT EvInf MultiCite SciERC SciFact SciFact
F1 F1 BLEU "Fuzzy" F1 F1 F1 F1-Label F1-Token
Llama-3-8B-Inst. (original) 43.3 40.3 37.3 13.5 37.9 25.4 42.3 40.1
Llama-3-8B-Inst. (reproduced) 43.1 ± 1.5 40.1 ± 1.1 36.8 ± 2.6 15.2 ± 1.7 34.9 ± 8.2 13.2 ± 4.6 22.0 ± 7.7 20.6 ± 7.2

I am uncertain what the reason could be that the reproduced results for SciERC and SciFact are different compared to the original. Do you know what could be the cause of this?

There is a slight change in the --model_args due to memory issues. I added gpu_memory_utilization or max_model_len and removed tensor_parallel_size. I am running the following command:

python -m lm_eval \
  --include_path ./sciriff/eval/eleuther_templates/general \
  --model vllm \
  --model_args pretrained=meta-llama/Meta-Llama-3-8B-Instruct,dtype=float16,gpu_memory_utilization=0.85,max_model_len=5120 \
  --gen_kwargs max_gen_toks=1024 \
  --tasks bioasq_list_qa,biored_ner,discomat_te,evidence_inference,multicite_intent_classification,scierc_ner,scifact_entailment \
  --batch_size auto \
  --output_path results/ \
  --seed 42 \
  --predict_only \
  --log_samples

And I am using the following seeds: 42, 1337, 9876, 12345, 999999, 98765, 5555, 2024, 267, 10.

I am running experiments on a single RTX A6000 (48 GB) using CUDA 12.4, driver version 550.90.07, and Python 3.10.13 with the following versions of the packages:

huggingface-hub==0.24.5
jinja2==3.1.4
jsonschema==4.23.0
https://github.com/EleutherAI/lm-evaluation-harness.git@e74ec966556253fbe3d8ecba9de675c77c075bce
nltk==3.8.1
openai==1.37.2
pandas==2.2.2
pyyaml==6.0.1
rouge_score==0.1.2
spacy==3.7.5
vllm==0.5.4
lihaoxin2020 commented 1 month ago

Hi,

Thanks for reaching out. Seems like you are not passing in a prompting template, which is essentially no template, and that could be a major reason for the discrepancy.

We provide an example template for Llama 3 series in the latest push so try again with --chat_template llama3 \

e-tornike commented 1 month ago

Hey,

thanks for your response.

As I understand it, the template is actually passed as an argument to the --include_path flag (see here). This is what my code above is passing directly to lm_eval.

Your recent update #3 is now adding the --apply_chat_template flag.

lihaoxin2020 commented 1 month ago

i see. general template in the repo basically means no template, which directly dumps the dataset input to the model as shown here. you might want to include llama template or try apply_chat_template

e-tornike commented 1 month ago

True, but I guess the real question I am asking is if the numbers in the preprint (https://arxiv.org/abs/2406.07835) used no template (i.e., general), as described in the README, or one specific for the underlying model. Do you know which is the case?

lihaoxin2020 commented 1 month ago

@e-tornike Thanks for pointing this out. For your question, we did not use any template for Llama models in the preprint (which is not correct). Our results reported misaligned with yours because we did not set the EOS token correctly for non-tulu finetuned model, so the model would keep generating and repeat itself until hitting the max tokens. Sometimes the model generates a few versions of answer, which mistakenly push the score higher after parsing. Your results without template are correct (I reproduced the same score w/o templates). The right way of doing these experiments is using templates.

Here are our new scores on Llama2-7B and Llama3-8B with corresponding templates for reference (Tulu models in the preprint are fine): Task bioasq biored discomat evidence_inference multicite mup qasper qasper scierc scifact scifact Mean Median
Metric f1 f1 bleu f1_overlap f1 lm_judge_reference lm_judge_answer f1_evidence f1 f1_label f1_evidence_token
Llama-2-7b-chat-hf 34.48 19.78 35.57 13.32 27.27 77.25 7.94 2.29 6.88 50.41 31.68 27.90 27.27
Meta-Llama-3-8B-Instruct 44.28 47.06 59.47 0.15 50.08 85.50 55.14 41.19 28.76 68.36 53.36 48.49 50.08

This EOS issue won't happen in the recent push with new lm-eval dependency. In our latest version of the paper, we have already remade this table and will update our preprint soon.

Thanks again for pointing out. Let me know if you have further questions!