ffftzh / BTM-Java

A java implement of Biterm Topic Model
20 stars 6 forks source link

Maybe ,I found a bug #1

Open ShanceWang opened 8 years ago

ShanceWang commented 8 years ago

if a document in the docs contains only one word,then after running the program ,the result for it in "model-final.theta" will be all zero

ffftzh commented 8 years ago

@ShanceWang BTM trains topics on word co-occurrence. Documents are treated as a mixture of co-occurred word-pairs. So the document is meaningless when it only contains one word and doesn't have any word-pairs.

ShanceWang commented 8 years ago

Thanks for your reply. uh, another problem. Sometimes the space would be recognized as a word,so it appers in the wordmap with a label.(I'm sure the pre-process for the doc is good) I've observed that,maybe,it's caused by some documents,which contains only two same word as "day day"or "danger danger". so it can be explained as the same reason ? Thanks again for your kind help!

ffftzh commented 8 years ago

Maybe there is some empty line in the dataset or it is just a special character that looks like a space.