RichardLitt / language-niche-research

Scientific repository for fiddling around with linguistic data
MIT License
2 stars 0 forks source link

Subtitle corpora #1

Open RichardLitt opened 9 years ago

RichardLitt commented 9 years ago

To do:

RichardLitt commented 9 years ago

Look at:

RichardLitt commented 9 years ago

cating all files will result in strange behavior at the edges of the files. Would be best to cat each file as a new line in each file, and then parse them, maybe. Probably not a big deal.

glupyan commented 9 years ago

What kind of strange behavior? Just because it's a newline?

RichardLitt commented 9 years ago

Nah, I was worried that any ngrams that go over the divide wouldn't be useful as they would be from different speakers. Just checked again, and the corpus as a whole doesn't differentiate between speakers, though, so this is really moot. Catting it all is fine.