CarperAI / Code-Pile

This repository contains all the code for collecting large scale amounts of code from GitHub.
MIT License
105 stars 29 forks source link

Data Processing #37

Open ncoop57 opened 1 year ago

ncoop57 commented 1 year ago

We should follow a similar process to the BigScience workshop's dataset processing. They include many of the tools ready for us to use such as data deduplication, both exact match and near dedup, filtering of low information content examples, removal of potentially hateful documents, and removal of PII.

They have all their tools available and discussions of them here: https://github.com/bigscience-workshop/data_tooling

Here is an initial set of tasks to perform:

PhungVanDuy commented 1 year ago

New repo: https://github.com/bigscience-workshop/data-preparation/tree/main/preprocessing/training

PhungVanDuy commented 1 year ago

@ncoop57 Filtering: https://github.com/bigscience-workshop/data-preparation/blob/main/preprocessing/training/01b_oscar_cleaning_and_filtering/filtering.py Deduplicated: https://github.com/bigscience-workshop/data-preparation/blob/main/preprocessing/training/01b_oscar_cleaning_and_filtering/deduplicate/self_deduplicate.py PII: https://github.com/bigscience-workshop/data-preparation/blob/main/preprocessing/training/01b_oscar_cleaning_and_filtering/anonymization.py

everyone can use this for filtering and deduplication, I am writing a light version for it. My dataset is quite clean on the original so some function is not necessary but other people can get some useful function from there.

PhungVanDuy commented 1 year ago

@ncoop57 I created a simple one script to take the code from the source above for Wikibooks dataset, it just needs to get the parquet file and run filtering and dedup, I pushed it here https://github.com/PhungVanDuy/Code-Pile/tree/books_dataset/codepile/enwikibooks/data_process_pipeline.

Please review the threshold if you guys want to use it for your dataset.

PhungVanDuy commented 1 year ago

I propose a workflow like this: convert data to HuggingFace Dataset object our pipeline will process on this format after that: remove small documents/code (with words count), remove documents containing flagged words, PII, deduplicate (near-deduplicate) for both code and documents. Just found that BigCode also implements for PII and dedup: https://github.com/bigcode-project/bigcode-analysis/tree/main/data_analysis

ncoop57 commented 1 year ago

We will not dedup the following:

  1. The Stack - since it has already had dedup ran on it
  2. GitHub Diffs - as it would remove too many instances due to their short length and high overlap
ncoop57 commented 1 year ago

We will use this lib: https://github.com/CarperAI/squeakily to manage the different filtering and cleaning steps on a per dataset basis. There will be a global set of filters and cleaners that will be applied to each dataset such as flagged word removal and a local set of filters and cleaners that are specific to each data source.