GAIR-NLP / ProX

Offical Repo for "Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale"
https://gair-nlp.github.io/ProX/
Apache License 2.0
195 stars 15 forks source link

Multi-language support? #3

Open DumoeDss opened 2 months ago

DumoeDss commented 2 months ago

Great and insightful work! If the refinement model does not support multilingualism, will it work for multilingual datasets?

koalazf99 commented 2 months ago

Thank you for your kind words!😄 Since we haven't conducted experiments on multilingual data, I don't have a definitive answer, but I think ProX could work better given proper SFT data. Please note that we provided fairly detailed prompts in the appendix(around page 21~24), and I believe applying these prompts directly to data in other languages's to generate SFT data might be a viable approach.