Add XLSum evaluation / unify eval script

bigscience-workshop / multilingual-modeling

BLOOM+1: Adapting BLOOM model to support a new unseen language

https://arxiv.org/abs/2212.09535

Apache License 2.0

69 stars 15 forks source link

Add XLSum evaluation / unify eval script #12

Open haileyschoelkopf opened 2 years ago

haileyschoelkopf commented 2 years ago

Submitting a PR from fork because I may not have edit access to this repo.

In this PR: added adapters_eval.py , a script that can be used to evaluate on XLSum or XNLI based on the 'dataset' flag. Also working on adding deepspeed compatibility via Huggingface Trainer / command line.

TODO/needs checking:

rouge compute_metrics function could be wrong. I will try to check this
make sure the logic within load_model for setting adapters to train / adding adapters is correct.
Has the FIXME in adapters_xnli_de.py been dealt with?

yongzx commented 2 years ago

Thanks Hailey!

(Referring to #11) Will resolve this PR once Vassilina and I have finalized on our evaluation script on XNLI. Apologies for the delay.

haileyschoelkopf commented 2 years ago

The remaining TODOs for this script are:

the logic for loading adapters in load_model needs to be checked (it was unclear to me whether the XNLI script's logic was correct or if it was still a work in progress.)
Which ROUGE metrics to report? Currently reporting Fmeasure for all rouge metrics, but if precision and recall are desired, can be easily changed.
EDIT: And also, what to set for max_generation_length in prediction and max_length in tokenization.

yongzx commented 2 years ago

Apologies for reviewing this PR late. I have made some comments, but at the end, I think I will create another PR based on your committed files, and request Vassilina's and your review again.

Please don't push any changes if that's okay.

Edit: Commenting on this PR for the to-dos of integration:

use hugginface evaluate library
test run the code.

yongzx commented 2 years ago

Which ROUGE metrics to report? Currently reporting Fmeasure for all rouge metrics, but if precision and recall are desired, can be easily changed. And also, what to set for max_generation_length in prediction and max_length in tokenization.

From the paper, "Due to computational constraints, we used the base model (600M parameters) and had to truncate the inputs to 512 tokens and the outputs to 64 tokens. We used the ROUGE-1, ROUGE-2 and ROUGE-L (Lin, 2004) scores for automatic evaluation. For inference, we used beam search with beam size 4 and length penalty of α = 0.6 (Wu et al., 2016)."

yongzx commented 2 years ago

@haileyschoelkopf Can you help review b0a23c5? Thank you! I've tested it and the training and evaluation (on baseline BLOOM and GPT2 models) are working. The only minor issue is that the evaluation that uses model.generate takes quite long (even for num_beams = 1).

haileyschoelkopf commented 2 years ago

Yes I can! I might only get to it tomorrow though