The JSTOR "language" metadata is relatively unreliable so even though I can filter out known non-English documents, about 40% of the corpus don't have a language tag. Most of these are English, but there are quite a few in Spanish and German - enough that the topic modeling picks up on those subsets as a "topic"
The JSTOR "language" metadata is relatively unreliable so even though I can filter out known non-English documents, about 40% of the corpus don't have a language tag. Most of these are English, but there are quite a few in Spanish and German - enough that the topic modeling picks up on those subsets as a "topic"