bigscience-workshop / multilingual-modeling

BLOOM+1: Adapting BLOOM model to support a new unseen language
https://arxiv.org/abs/2212.09535
Apache License 2.0
69 stars 15 forks source link

Add XLSum evaluation / unify eval script #12

Open haileyschoelkopf opened 2 years ago

haileyschoelkopf commented 2 years ago

Submitting a PR from fork because I may not have edit access to this repo.

In this PR: added adapters_eval.py , a script that can be used to evaluate on XLSum or XNLI based on the 'dataset' flag. Also working on adding deepspeed compatibility via Huggingface Trainer / command line.

TODO/needs checking:

yongzx commented 2 years ago

Thanks Hailey!

(Referring to #11) Will resolve this PR once Vassilina and I have finalized on our evaluation script on XNLI. Apologies for the delay.

haileyschoelkopf commented 2 years ago

The remaining TODOs for this script are:

yongzx commented 2 years ago

Apologies for reviewing this PR late. I have made some comments, but at the end, I think I will create another PR based on your committed files, and request Vassilina's and your review again.

Please don't push any changes if that's okay.

Edit: Commenting on this PR for the to-dos of integration:

yongzx commented 2 years ago

Which ROUGE metrics to report? Currently reporting Fmeasure for all rouge metrics, but if precision and recall are desired, can be easily changed. And also, what to set for max_generation_length in prediction and max_length in tokenization.

From the paper, "Due to computational constraints, we used the base model (600M parameters) and had to truncate the inputs to 512 tokens and the outputs to 64 tokens. We used the ROUGE-1, ROUGE-2 and ROUGE-L (Lin, 2004) scores for automatic evaluation. For inference, we used beam search with beam size 4 and length penalty of α = 0.6 (Wu et al., 2016)."

yongzx commented 2 years ago

@haileyschoelkopf Can you help review b0a23c5? Thank you! I've tested it and the training and evaluation (on baseline BLOOM and GPT2 models) are working. The only minor issue is that the evaluation that uses model.generate takes quite long (even for num_beams = 1).

haileyschoelkopf commented 2 years ago

Yes I can! I might only get to it tomorrow though