Open suzyahyah opened 2 years ago
@suzyahyah Read their paper, page 9. https://arxiv.org/abs/2101.00027 https://arxiv.org/pdf/2101.00027.pdf
A fully multi-lingual expansion of the Pile is in their future plans. I don't know whether or not that includes being able to differentiate between languages it's speaking.
Hi,
Has there been any Language ID of the sentences in PILE, and also quantifying their proportions? We can get an idea from Europarl, but it is less clear with Common crawl in the mix.
I have not seen this in any of the official documentation or the paper. If I missed something please let me know.
Thanks!