GAIR-NLP / ProX

Offical Repo for "Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale"
https://gair-nlp.github.io/ProX/
Apache License 2.0
145 stars 7 forks source link

Code Dataset #5

Open mtasic85 opened 3 days ago

mtasic85 commented 3 days ago

Hi there, great work!

Do you have plans for code datatset, and if yes when can we expect it?

koalazf99 commented 1 day ago

Hi @mtasic85, thank you for your interest in ProX! We will try it on code data in the coming days; however, I can't confirm the exact timeline yet. Unlike our web and math data which largely come from web documents, code data mainly comes from GitHub, which involves a different approach for downloading, organizing and training. We're still figuring out how to conduct some initial experiments.