ariddell / tatom

Quantitative Text Analysis for the digitale Geisteswissenschaften
https://de.dariah.eu/tatom/
47 stars 17 forks source link

Topic modeling_Split the novels #13

Closed dkltimon closed 9 years ago

dkltimon commented 9 years ago

Hi Allen,

sorry to bother you again, i have a dumm question:

https://de.dariah.eu/tatom/topic_model_mallet.html

Here you wrote "Because these are lengthy texts, the novels are split up into smaller sections—a preprocessing step which improves results considerably."

My question is, is there any rules about the length (or size) of the smaller sections? One paragraph as a section? One chapter of the novel? Or maybe the length of the smaller sections is not important, since we will combine the results of topic modelling in the end after all.

I've noticed, that almost all your data are about 6 or 7 kB. I assume maybe this is the right way?

Thanks a lot!

ariddell commented 9 years ago

If you have reliable information about paragraphs there's no reason not to model paragraphs other than the increase in computation time. Chapters are great as well. Otherwise every 1000 words would work -- but the choice is arbitrary.

christofs commented 9 years ago

All valid choices, I guess. Ideally, I think we would split on borders between "scenes", the assumption(s) being that scenes form a meaningful unit, that they may have just the right size (although they may be very unequal in length), and that it makes sense to keep such units intact for best results from topic modeling. However, we usually don't have any information about scene boundaries. So the next larger unit we tend to have information about is the chapter. And the next smaller unit we tend to have informatin about is the paragraph. However, paragraphs in novels are sometimes extremely short, if you define their border by a newline. For example, if there is an extended dialogue, each statement by a person will be one paragraph. So I think the best solution we currently have is to use something like "around n words (maybe 1000, or 2000), but cutting only on paragraph boundaries". What would be really interesting is a study comparing results of topic modeling for the same texts but using different splitting strategies. Just to see whether it even makes a big difference.