Original note:
Call get_data() from preprocess.py to get IterableDataset of the hugging face data with cleaning and filtering. We remove line comments, docstrings, and trailing whitespace / extra new lines. The logic to remove docstrings while ignoring multiline strings was somewhat sketchy. I set the delimiter of multiline strings """ to a flag value, then regex over all doc strings, then replace flag value back to """. You can look at first 100 example input by running main to test and it looks correct. There are also some tests for the clean_comments function itself. The filtering is also pretty rudimentary, just include content if it imports any popular data science package. Pushing to main because retrain-tokenizer and chunking need this feature.
Replacement for https://github.com/MichiganDataScienceTeam/F24-mini-copilot/pull/27
Original note: Call get_data() from preprocess.py to get IterableDataset of the hugging face data with cleaning and filtering. We remove line comments, docstrings, and trailing whitespace / extra new lines. The logic to remove docstrings while ignoring multiline strings was somewhat sketchy. I set the delimiter of multiline strings """ to a flag value, then regex over all doc strings, then replace flag value back to """. You can look at first 100 example input by running main to test and it looks correct. There are also some tests for the clean_comments function itself. The filtering is also pretty rudimentary, just include content if it imports any popular data science package. Pushing to main because retrain-tokenizer and chunking need this feature.