EleutherAI / the-pile

MIT License
1.44k stars 122 forks source link

Appending data to the Pile. #99

Open shankerabhigyan opened 2 years ago

shankerabhigyan commented 2 years ago

Hi,

I wanted to know if Pile will be looking to integrate multilingual data anytime soon. There are some organisations in India with archived scholarly articles and research work which haven't received the exposure they deserve because of language barriers in international research.

I also wanted to gain some more clarity on what are the key steps that are followed after the data is converted to the jsonlines format. It's also been mentioned that the lm_dataset format has to be followed for the new data to be appended, could you please give more clarity on what are the key attributes of that format and how and at what point of the entire process does it relate to the final formation of GPT-J. Thank you.

dboggs95 commented 1 year ago

@shankerabhigyan Read their paper, page 9. https://arxiv.org/abs/2101.00027 https://arxiv.org/pdf/2101.00027.pdf

A fully multi-lingual expansion of the Pile is in their future plans.