MichiganDataScienceTeam / F24-mini-copilot

Building and deploying a lightweight code autocompletion tool, from GPT-2 weights to a working VSCode extension.
MIT License
9 stars 0 forks source link

Feat/data preprocessing #28

Closed USSiamaboat closed 3 weeks ago

USSiamaboat commented 3 weeks ago

Replacement for https://github.com/MichiganDataScienceTeam/F24-mini-copilot/pull/27

Original note: Call get_data() from preprocess.py to get IterableDataset of the hugging face data with cleaning and filtering. We remove line comments, docstrings, and trailing whitespace / extra new lines. The logic to remove docstrings while ignoring multiline strings was somewhat sketchy. I set the delimiter of multiline strings """ to a flag value, then regex over all doc strings, then replace flag value back to """. You can look at first 100 example input by running main to test and it looks correct. There are also some tests for the clean_comments function itself. The filtering is also pretty rudimentary, just include content if it imports any popular data science package. Pushing to main because retrain-tokenizer and chunking need this feature.

USSiamaboat commented 3 weeks ago

https://github.com/MichiganDataScienceTeam/F24-mini-copilot/pull/28/commits/bae964d5828cbf00bfaadc77299563a43ef46aa5 Fixes merge conflicts