Number of text corpuses

DARIAH-DE / TopicsExplorer

Explore your own text collection with a topic model – without prior knowledge.

https://dariah-de.github.io/TopicsExplorer

Apache License 2.0

62 stars 10 forks source link

Number of text corpuses #125

Open Glorifier85 opened 3 years ago

Glorifier85 commented 3 years ago

Hi there,

first of, great application! Intuitive and easy to use - eaxactly what I needed. The question I have is: is there a reason why the minimum number of texts to be chosen is ten? I am sure there is but can we change that somehow? What if I just wanted to tokenize and compare two corpuses?

Thanks! Glorifier

severinsimmler commented 3 years ago

Hi @Glorifier85,

thank you for the positive feedback, we are very happy that the application is useful for people.

Topic modeling is a technique that works well with a large number of documents. I think it makes theoretical and practical no sense to topic model less than 10 documents (but 10 is actually more or less randomly chosen). Please refer e.g. Tang et al.:

The number of documents plays perhaps the most important role; it is theoretically impossible to guarantee identification of topics from a small number of documents, no matter how long.

The length of the documents also plays an important role. Maybe you should consider segmenting documents of your small corpus – topic modeling works even with tweets (i.e. 280 characters) quite well (see e.g. Ordun et al.).

Glorifier85 commented 3 years ago

Hi @severinsimmler,

thanks for your response, much appreciated! Understood re the number of corpuses. Frankly, I could see why the length of documents plays a role but I dont quite understand why the sheer number of documents would be so important. But I'll have a close look at the papers you've linked.

Speaking of document length, is there an optimal length in terms of word count? Törnberg (2016), see link below, for example mention that they split documents in chunks of 1000-word texts. Is this something you can confirm? https://www.sciencedirect.com/science/article/pii/S2211695816300290

Many thanks!

severinsimmler commented 3 years ago

but I dont quite understand why the sheer number of documents would be so important

Most natural language processing algorithms are designed to extract information from an extensive data set. In general, one could say the more the better. Always. But I think this is also a question of methodology. If I only have two documents, why do I need a quantitative method? I could evaluate the texts with qualitative methods (e.g. close reading) and probably gain more valuable insights.

they split documents in chunks of 1000-word texts. Is this something you can confirm?

Yes, 1000 words per document is a good starting point. I don't know how your texts are structured, but you could also segment by paragraph or chapter.

Glorifier85 commented 3 years ago

Thanks again! To your knowledge, is there a maximum number of words per document that should not be exceeded, like a hard cap? I am planning to model social media comments from news outlets over a certain period of time (3-5 years) so I might end up with >500k words per document.

Thanks!