bigcode-project / bigcode-evaluation-harness

A framework for the evaluation of autoregressive code generation language models.
Apache License 2.0
771 stars 201 forks source link

[WIP; Input Requested] Integrated DS1000 task #39

Closed benlipkin closed 1 year ago

benlipkin commented 1 year ago

This PR addresses the Add DS-1000 Milestone.

I have included an end-to-end task implementation.

Notes & Remaining Qs:

Please share other comments/edits. Thanks.

loubnabnl commented 1 year ago

Thanks for the integration Ben, the code looks good! I can do a more detailed review after we address some points. The data download approach you used also looks good to me too and solves the upload issue.

For the dependencies, a better approach would be to add a setup.py file to the evaluation harness and list them in extras so the user installs them from the beginning. I can add this file or you can open a PR if you like, but you can assume they are installed for this task.

Regarding the prompt format in get_prompt, I think we need to adapt it. In the benchmark there are two generation modes: Insertion (default=infilling) and Completion (for left-to-right). So we should probably have two tasks for each library e.g numpy-insertion and numpy-completion. The first mode seems to give better results and for Codex they build a prefix and suffix here, we can do the same and build a prompt based on the --fim_tokens an argument we can add to the evaluation harness. For example SantaCoder takes as FIM input:

input_text_fim = "<fim-prefix>prefix<fim-suffix>suffix<fim-middle>"

Let's also check what the infilling format for InCoder is and try to make the prompt generic (these are the only code models that support infilling now AFAIK). This also requires changing the post-processing of the generations.

For the other mode (Completion) it seems they only feed the prefix (see) but let's double check that.

When we finish, we can evaluate InCoder and compare results in both modes to the paper. We have some authors of the benchmark in BigCode Slack that we can consult if there's a discrepancy.

cc @Muennighoff if you have any comments

benlipkin commented 1 year ago

Hi team, new update.

As suggested, I've shifted dependency management from runtime to setup. You can find setup.py here, and I've noted this requirement in the README here.

As per the infilling integration, it's proved a bit more complex than initially anticipated. The Task edits are minimal, but prompt restructuring and post-processing requires access to the tokenizer, so cannot be completed entirely within the task, and I've instead addressed such requirements in the TokenizedDataset object and complete_code function in utils.py here. I've worked off of the short script Loubna sent me over slack and Arjun's infilling demos here, which seem to have been built with guidance from Carolyn and Daniel. I've managed to get things working end-to-end for single samples, e.g.,

accelerate launch main.py --model facebook/incoder-1B --tasks ds1000-numpy-insertion --limit 1 --allow_code_execution=True

accelerate launch main.py --model bigcode/santacoder --tasks ds1000-numpy-insertion --limit 1 --allow_code_execution=True --revision dedup-alt-comments --trust_remote_code True

but, am seeing weird tokenization patterns when limit>1 invokes batching. The text appears fine up until the initial special token, and then gets weirdly mangled afterwards. I'm unsure how to proceed and might request a second set of eyes on this if anyone on Daniel Fried's team has bandwidth.

Let me know what you think, thanks.

benlipkin commented 1 year ago

Thanks, have addressed all minor comments and suggestions in the relevant conversations above.

In terms of FIM, I've fixed the bug from tokenizing multiple samples. In the complete_code function during generation, input_ids are selected as input_ids=batch["ids"][:, : batch["input_len"]]. So, with left padding, the prompt gets thrown out and it starts generating from all padding, creating nonsense output. Reverting to right padding as in the DS-1000 script fixes this. Left-padding was required in Arjun's code, since multiple prompts of different lengths were evaluated in parallel, but in the DS-1000 script and here each task is evaluated separately and batching is only used for multiple generations from the same prompt. In summary, FIM is now working for me and passing all my tests.

I've also compared prompts with the DS-1000 eval script after adjusting the linebreaks for the first few samples, and everything lines up.

loubnabnl commented 1 year ago

Perfect, great work! I think we should also aim to have it work in batch mode to merge this PR since it will make the evaluation much faster. But maybe we can merge this PR first and then do the necessary changes in the other PR, wdyt?

benlipkin commented 1 year ago

Hi Loubna, in introducing batching more generally, Logesh has already covered this use case: shift padding to left here and stop right-truncate here, when batch.dim[0]>1, so this should work right out of the box with his PR.

benlipkin commented 1 year ago

Awesome, thanks Loubna! I've posted the requested edits as well as updates to documentation. Let me know if any final adjustments needed on these last few commits.