EleutherAI / the-pile

MIT License
1.46k stars 126 forks source link

Europarl #25

Closed StellaAthena closed 3 years ago

StellaAthena commented 3 years ago

Transcripts from EU Parliament meetings from 1996 to 2011. Contains approximately 4.5 GB of text.

Languages: French, Italian, Spanish, Portuguese, Romanian, English, Dutch, German, Danish, Swedish, Bulgarian, Czech, Polish, Slovak, Slovene, Finnish, Hungarian, Estonian, Latvian, Lithuanian, and Greek.

Link: www.statmt.org/europarl/

StellaAthena commented 3 years ago

Temporarily closing while we finish version 1.

thoppe commented 3 years ago

I could pull this, clean it up and look to see how it's organized if we are still interested. The parallel texts in many languages is interesting too. For v1, do we still want to keep all languages in though?

StellaAthena commented 3 years ago

We are pretty much done with V1 (8 GiB short) so I went ahead and removed the “deferred” label from all of the V2 datasets. You’re welcome to do this, but it’s going in V2 not V1.

On Sun, Sep 20, 2020 at 10:59 PM Travis Hoppe notifications@github.com wrote:

I could pull this, clean it up and look to see how it's organized if we are still interested. The parallel texts in many languages is interesting too. For v1, do we still want to keep all languages in though?

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/EleutherAI/The-Pile/issues/25#issuecomment-695880121, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADZJVMDWU2VEL6EYJR4NQO3SG26RHANCNFSM4Q5A5NRQ .

thoppe commented 3 years ago

Starting the processing on this. For reference, the data file is 1.5GB but it takes over 14 hours to download from the main site.

thoppe commented 3 years ago

This is complete. The processing code is here

https://github.com/thoppe/The-Pile-EuroParl

with the temporary download link here https://drive.google.com/file/d/15kQ6jAGHsI3ZrA0ibXGuTmzGdib9NA63/view?usp=sharing

  ✔ Saved to EuroParliamentProceedings_1996_2011.jsonl
  ℹ Saved 187,072 articles
  ℹ Uncompressed filesize   4,941,430,389
  ℹ Compressed filesize     1,475,803,930

Once incorporated, this issue can be close and moved to the completed section.