Subtitle corpora - Githubissues

RichardLitt / language-niche-research

Scientific repository for fiddling around with linguistic data

MIT License

2 stars 0 forks source link

Subtitle corpora #1

Open RichardLitt opened 9 years ago

RichardLitt commented 9 years ago

To do:

[x] Get relevant n-grams of the corpora.
[ ] Compare different n-grams for co-occurrence in both English and US corpora.
[ ] Check out surprisal tool - used to be in NLTK. Find out why it was removed, where it can be used now. Get the package from the Piantadosi paper.

RichardLitt commented 9 years ago

Look at:

ngrampy
This discussion of calculating perplexity in NLTK: link
This code -- potentially fixed -- but still not bundled with NLTK: link
More python n-gram stuff here

RichardLitt commented 9 years ago

cating all files will result in strange behavior at the edges of the files. Would be best to cat each file as a new line in each file, and then parse them, maybe. Probably not a big deal.

glupyan commented 9 years ago

What kind of strange behavior? Just because it's a newline?

RichardLitt commented 9 years ago

Nah, I was worried that any ngrams that go over the divide wouldn't be useful as they would be from different speakers. Just checked again, and the corpus as a whole doesn't differentiate between speakers, though, so this is really moot. Catting it all is fine.