EleutherAI / the-pile

MIT License
1.46k stars 126 forks source link

United Nations Publications #38

Closed cfoster0 closed 3 years ago

cfoster0 commented 3 years ago

Languages: English, French, Spanish, Arabic, Russian, Chinese (should have translations for all of these) Date ranges: 1946-2020 Size: 700,000 publications Link to UN digital library.

Outstanding questions:

cfoster0 commented 3 years ago

Splitting out the speeches into a separate Issue #39 .

cfoster0 commented 3 years ago

The PDF translations of a given document in the library are listed at https://digitallibrary.un.org/record/[NUMBER]/files/

I'm not entirely sure yet how they're ordered.

cfoster0 commented 3 years ago

@StellaAthena Feel free to assign this to me.

cfoster0 commented 3 years ago

I've completed the url-collection portion of this. There are a bit over 1.8M downloadable PDFs in the database, spread fairly evenly across the 6 official languages.

StellaAthena commented 3 years ago

Awesome!

StellaAthena commented 3 years ago

@cfoster0 What ended up happening with this?

cfoster0 commented 3 years ago

Nothing new. If someone is interested in downloading the docs and/or converting them to text, I'd be happy to share. Was waiting for v1 work to finish, otherwise.