Evaluation on Code-to-text CodeXGLUE task: references preprocessing

loubnabnl commented 1 year ago

Hi, I was wondering if you could share the preprocessing script of the reference comments/docstrings in the code-to-text task from CodeXGLUE to remove the extra context.

Also sometimes the reference solution is long with many lines while the candidate solution only has one, do you only keep one line for the references too?

Thanks in advance.

dpfried commented 1 year ago

Hi Loubna,

Apologies that I haven't gotten this integrated into this repo yet, but I have an implementation in my fork of lm-evaluation-harness.

The reference processing is in https://github.com/dpfried/lm-evaluation-harness/blob/5d9a6aaaaa929bcad95bb73d85e78fe75eb64b4e/lm_eval/tasks/codexglue_summarization.py#L146

and the model prompt creation is in https://github.com/dpfried/lm-evaluation-harness/blob/5d9a6aaaaa929bcad95bb73d85e78fe75eb64b4e/lm_eval/tasks/codexglue_summarization.py#L71

Yes, we kept only one line for the references, and also tokenized them (following email correspondence one of our authors had with the codexglue authors). Let me know if any questions or if you run into issues!

Daniel

loubnabnl commented 1 year ago

Thanks a lot for your reply Daniel! This is helpful

dpfried / incoder

Evaluation on Code-to-text CodeXGLUE task: references preprocessing #10