English-only docs in JSTOR corpus

cmu-lib / text_explorer

Shiny app for exploring multiple corpora with text vector models including topic models, word2vec, doc2vec, keyness

0 stars 0 forks source link

English-only docs in JSTOR corpus #6

Closed mdlincoln closed 4 years ago

mdlincoln commented 4 years ago

The JSTOR "language" metadata is relatively unreliable so even though I can filter out known non-English documents, about 40% of the corpus don't have a language tag. Most of these are English, but there are quite a few in Spanish and German - enough that the topic modeling picks up on those subsets as a "topic"

mdlincoln commented 4 years ago

Turns out the metadata is not going to help us here - JSTOR doesn't have it consistently available in their metadata.