CarperAI / Code-Pile

This repository contains all the code for collecting large scale amounts of code from GitHub.
MIT License
105 stars 29 forks source link

Data pipeline: Filtering, PII, Deduplicate - Wikibooks, AI4Code, Competitive Programming, ... #41

Closed PhungVanDuy closed 1 year ago

ncoop57 commented 1 year ago

Hey @PhungVanDuy it looks good! One thing is could you prepend the licensing information for the files from the bigcode repository so people can easily identify it is theirs?

ncoop57 commented 1 year ago

Do you think it is possible to easily run the near dedup across data sources or better to just do each on their own?

ncoop57 commented 1 year ago

Gonna close this since we are going a different direction