EleutherAI / the-pile

MIT License
1.48k stars 129 forks source link

(Natural) Languages in The PILE #98

Open suzyahyah opened 2 years ago

suzyahyah commented 2 years ago

Hi,

Has there been any Language ID of the sentences in PILE, and also quantifying their proportions? We can get an idea from Europarl, but it is less clear with Common crawl in the mix.

I have not seen this in any of the official documentation or the paper. If I missed something please let me know.

Thanks!

dboggs95 commented 1 year ago

@suzyahyah Read their paper, page 9. https://arxiv.org/abs/2101.00027 https://arxiv.org/pdf/2101.00027.pdf

A fully multi-lingual expansion of the Pile is in their future plans. I don't know whether or not that includes being able to differentiate between languages it's speaking.