bigcode-project / bigcode-evaluation-harness

A framework for the evaluation of autoregressive code generation language models.
Apache License 2.0
795 stars 214 forks source link

[WIP] Add CodeXGLUE-text-to-text benchmark for documentation translation #20

Closed infinitylogesh closed 1 year ago

infinitylogesh commented 1 year ago

This a PR to add fewshot examples and fine-tuning code for CodeXGLUE-text-to-text ( as per #3 )

Few shot examples were created from the test set of the dataset. Please suggest if there is any changes to the json key naming convention or the number of examples included.

Tasks:

cc: @loubnabnl

loubnabnl commented 1 year ago

Thanks for working on this! I think maybe we should take examples from the training set and not the test set, as this will be data leakage when we evaluate on the same test samples we used for few-shot.

Also since the benchmark includes translation between 4 languages and english, we could have a dictionary with 4 keys = 4 languages and each key has a dictionary of two examples from that language to english

few_shot_examples = {"danish": {"source1": ..., "source2":..., "target1":..., "target2":...}, 
                     "chinese": {"source1": ..., "source2":..., "target1":..., "target2":...},
                     "norwegian": {"source1": ..., "source2":..., "target1":..., "target2":...}, 
                     "latvian": {"source1": ..., "source2":..., "target1":..., "target2":...}}
infinitylogesh commented 1 year ago

Thank you for the feedback @loubnabnl . I have made the suggested changes to the few shot examples and also added the task to the evaluation (Following the guide).

Summary of changes I did to add CodexGlue Text - to - text task to evaluation code.

Also tested it for codegen-350M-mono on the task.

Please let me know if I missed something and any feedback. Thanks in advance.

loubnabnl commented 1 year ago

The codebase just went through some refactoring, I'm happy to help adapt it to the new format or take it from here if you prefer. Most of the work is already done it's just a matter of placing everything in one file instead of separate files.

infinitylogesh commented 1 year ago

@loubnabnl , I have now updated the task code based on the refactoring ( Thank you for the very clear guide , it helped ). Tested it with codegen-350M-mono on the task. Please let me know if I missed something and any feedback.

Please find the answers to the questions you had previously asked below.

  • All the few shot examples are from the training set, right?

Yes, The examples were taken from the training set.

  • The Latvian examples seem to have the same odd pattern with English/config text in beginning, I saw that it is common for some samples in this language, so it maybe we can keep one for generalization and replace the other with a cleaner example, what do you think? We can compare the evaluation scores in the two settings and see if it helps.

Thank you for the suggestion, It makes sense. I have updated the few shot example.

  • Do you know if some model was evaluated on this benchmark (even with fine-tuning) so we can use it as a reference?

I am not aware of any code generation models evaluated on this benchmark. But the paper associated with the dataset has results on evaluating multilingual encoder decoder models - NMT and Variation of NMT with encoder initialized from XLM-Roberta here

Sorry I was not able to work on this in the past weeks and respond sooner.

infinitylogesh commented 1 year ago

Thank you for the suggestions. I have updated the code. The results that I got from codegen-350M-mono are as below,

Task BLEU
zh_en 4.38%
lv_en 2.4%
da_en 0.69%
no_en 0.475%

Generation params used : temperature - 0.6 , max_length: 320

loubnabnl commented 1 year ago

Great, thanks for working on this!