EleutherAI / the-pile

MIT License
1.46k stars 126 forks source link

Screenplays (Subtitles don't contain info about who says & does what) #49

Closed christophschuhmann closed 3 years ago

christophschuhmann commented 3 years ago

We should consider adding a screenplay dataset in addition to opensubtitles, cause subtitles dont contain the the contextual information like where the actors go, what they do, how they behave, ...

This would contain valuable info about social interactions and situations.

image

I could write a scraper for several screenplay sites, if this would be liked. :)

StellaAthena commented 3 years ago

That sounds really interesting. I believe that Project Gutenberg contains many screenplays though, at least pre-1919 ones, and Bibliotik does too. I worry about the non-duplicative text : work ratio, but if you want to scrape some sites and see what you come away with by all means!

Christoph-Schuhmann commented 3 years ago

Great, I will look into it! :)

xylankant commented 3 years ago

maybe check out ScriptBase for this, which came out of my PhD thesis: https://github.com/EdinburghNLP/scriptbase not that large, but largely cleaned up

StellaAthena commented 3 years ago

@xylankant What is the total size of the text?

xylankant commented 3 years ago

@StellaAthena quite small, in the grand scheme of things Scripts were mostly crawled from imsdb.com, with a few other sources thrown in.

scriptbase_alpha (~1200 scripts, not cleaned up) totals around 235MB, scriptbase_j (~900 scripts, manually corrected for inconsistencies) is around 175MB.

I've just noticed that a few of the .tar.gz files in the corpus seem to be corrupted... I'll try and fix that when I can.

Christoph-Schuhmann commented 3 years ago

So far I've scraped ~12.000 scripts. But there are some duplicates in it and also some non searchable PDFs I need to convert with OCR or so.

Working on it.

xylankant commented 3 years ago

amazing! I'll leave it to you then đź‘Ť

edit: by the way, where are you scraping them from? when I compiled above corpus (granted, that was some years ago), there were very few actual scripts available (lots were "transcripts")

Christoph-Schuhmann commented 3 years ago

Here can you find my sources: https://docs.google.com/document/d/1D3i3zG18fk4Az0L1tMv5NyzawRdAgUQbhDnF3nq8bHs/edit?usp=sharing

There where I pasted a colab link, I have already scraped the scripts.

@xylankant If you'd like to help with the OCR etc I'd be very open :D

StellaAthena commented 3 years ago

scriptbase_alpha (~1200 scripts, not cleaned up) totals around 235MB, scriptbase_j (~900 scripts, manually corrected for inconsistencies) is around 175MB.

So far I've scraped ~12.000 scripts. But there are some duplicates in it and also some non searchable PDFs I need to convert with OCR or so.

Obviously the bigger the better, but we need this to be measured in the gigabytes at a minimum for it to be worthwhile. Ideally tens of gigabytes. There is still plenty of low-hanging fruit where we can download tens or hundreds of gigabytes that's been already scrapped. For example, there is 122 GB of Wikipedia we still haven't added. Our goal is 10 TiB (for version 2. version 1 is coming out in about a month and we've already frozen it's contents) so if we are working with datasets that are in the 100 MBs it'll take forever. The work:reward ratio just isn't there.

Christoph-Schuhmann commented 3 years ago

A script is about one hundred Kb. So i already have around one Gigabyte. :)

But to get much more will be difficult because i already described the biggest sites.

In the mid term future it would be interesting to create automatically pseudo Scripts from Videos on Netflix, Amazon Prime, YouTube,... by transcribing which speakers the dass what, which actions are performed on the current footage and which objects are available in the current picture. That wouldn't be a screenplay, but it would provide the algorithm with additional information about the current scene.

StellaAthena commented 3 years ago

30 pages per script? Hmmm.... for some reason that sounds low to me. I would have expected more, given that movies are several hours long.

What you’ve already collected sounds awesome to me. If you’d like to put in the work to try to push those numbers up a bit you can, but it’s already worth including and like I said there’s other lower-hanging bigger datasets.

Let me or @leogao2 know if you need help with data processing. We’re using a custom data format (see the README), but adding new data is pretty straight forward. Fork the repo, write a corresponding class in the-pile/datasets.py, and open a pull request. Please make sure to open a pull request to the Version2 branch and not the master branch.

Christoph-Schuhmann commented 3 years ago

PDFTOTXT actually works! :)

image

https://colab.research.google.com/drive/1jmyEvJHirF4Wapg1AiJ5ewqCsxioY06v?usp=sharing

Christoph-Schuhmann commented 3 years ago

I converted the PDFs, DOCs, etc to TXT and arrived at 9319 files and roughly 1GB size:

image

No dedupe done yet, will also try to find some more, but the big sites with many free scripts are already in it.

Christoph-Schuhmann commented 3 years ago

I have used textract( https://textract.readthedocs.io/en/stable/ ) to convert Image-based PDFs of screenplays to txt-files and the results are pretty good:

Original:

image

Extraction:

image