Closed benlipkin closed 1 year ago
Thanks for the integration Ben, the code looks good! I can do a more detailed review after we address some points. The data download approach you used also looks good to me too and solves the upload issue.
For the dependencies, a better approach would be to add a setup.py
file to the evaluation harness and list them in extras so the user installs them from the beginning. I can add this file or you can open a PR if you like, but you can assume they are installed for this task.
Regarding the prompt format in get_prompt
, I think we need to adapt it. In the benchmark there are two generation modes: Insertion (default=infilling) and Completion (for left-to-right). So we should probably have two tasks for each library e.g numpy-insertion
and numpy-completion
. The first mode seems to give better results and for Codex they build a prefix and suffix here, we can do the same and build a prompt based on the --fim_tokens
an argument we can add to the evaluation harness. For example SantaCoder takes as FIM input:
input_text_fim = "<fim-prefix>prefix<fim-suffix>suffix<fim-middle>"
Let's also check what the infilling format for InCoder is and try to make the prompt generic (these are the only code models that support infilling now AFAIK). This also requires changing the post-processing of the generations.
For the other mode (Completion) it seems they only feed the prefix (see) but let's double check that.
When we finish, we can evaluate InCoder and compare results in both modes to the paper. We have some authors of the benchmark in BigCode Slack that we can consult if there's a discrepancy.
cc @Muennighoff if you have any comments
Hi team, new update.
As suggested, I've shifted dependency management from runtime to setup. You can find setup.py
here, and I've noted this requirement in the README
here.
As per the infilling integration, it's proved a bit more complex than initially anticipated. The Task
edits are minimal, but prompt restructuring and post-processing requires access to the tokenizer, so cannot be completed entirely within the task, and I've instead addressed such requirements in the TokenizedDataset
object and complete_code
function in utils.py
here. I've worked off of the short script Loubna sent me over slack and Arjun's infilling demos here, which seem to have been built with guidance from Carolyn and Daniel. I've managed to get things working end-to-end for single samples, e.g.,
accelerate launch main.py --model facebook/incoder-1B --tasks ds1000-numpy-insertion --limit 1 --allow_code_execution=True
accelerate launch main.py --model bigcode/santacoder --tasks ds1000-numpy-insertion --limit 1 --allow_code_execution=True --revision dedup-alt-comments --trust_remote_code True
but, am seeing weird tokenization patterns when limit>1 invokes batching. The text appears fine up until the initial special token, and then gets weirdly mangled afterwards. I'm unsure how to proceed and might request a second set of eyes on this if anyone on Daniel Fried's team has bandwidth.
Let me know what you think, thanks.
Thanks, have addressed all minor comments and suggestions in the relevant conversations above.
In terms of FIM, I've fixed the bug from tokenizing multiple samples. In the complete_code
function during generation, input_ids are selected as input_ids=batch["ids"][:, : batch["input_len"]]
. So, with left padding, the prompt gets thrown out and it starts generating from all padding, creating nonsense output. Reverting to right padding as in the DS-1000 script fixes this. Left-padding was required in Arjun's code, since multiple prompts of different lengths were evaluated in parallel, but in the DS-1000 script and here each task is evaluated separately and batching is only used for multiple generations from the same prompt. In summary, FIM is now working for me and passing all my tests.
I've also compared prompts with the DS-1000 eval script after adjusting the linebreaks for the first few samples, and everything lines up.
Perfect, great work! I think we should also aim to have it work in batch mode to merge this PR since it will make the evaluation much faster. But maybe we can merge this PR first and then do the necessary changes in the other PR, wdyt?
Awesome, thanks Loubna! I've posted the requested edits as well as updates to documentation. Let me know if any final adjustments needed on these last few commits.
This PR addresses the Add DS-1000 Milestone.
I have included an end-to-end task implementation.
Notes & Remaining Qs:
Note 🎯: We only provide a zip file for the complete dataset to prevent possible web crawlers from crawling the files as training corpora. It's a precaution for future models' memorization. Please DO NOT upload the unzipped data files online.
As such, I have added an option for any Task to handle it's own dataset download here, and have automated this particular task to download it's own external files from it's constructor here.--allow_code_execution=False
for--allow_dependency_installation=False
with the base case raising an informative error message, and proceeding only if set toTrue
. I would lean towards the latter, to avoid requiring user input.Please share other comments/edits. Thanks.