EmilHvitfeldt / textdata

Download, parse, store, and load text datasets instead of storing it in packages
https://emilhvitfeldt.github.io/textdata/
Other
75 stars 13 forks source link

tidytext data #6

Closed EmilHvitfeldt closed 5 years ago

EmilHvitfeldt commented 5 years ago
EmilHvitfeldt commented 5 years ago

@juliasilge now is the time to suggest more datasets if you want 😄I know there have been interest earlier.

juliasilge commented 5 years ago

TBH I think NRC is going to have to be out-of-scope. I am still waiting on an email but the creator really does sound like he does not want this data redistributed at all.

I also think that stop words do not need to be in scope because of the excellent stopwords package, which tidytext depends on.

I am not sure the parts of speech dataset is worth spending time on because using this kind of unigram, tidy data approach performs quite poorly for POS tagging. You really do need a deep learning or otherwise more complex approach, such as that implemented in cleanNLP. I don't hear about anybody using this dataset really; I may just deprecate it, although I don't see a significant problem with the license either.

EmilHvitfeldt commented 5 years ago

Perfect, everything should be in order now.