dice-group / Palmetto

Palmetto is a quality measuring tool for topics
GNU Affero General Public License v3.0
209 stars 36 forks source link

All values NaN's #1

Closed ramkikannan closed 8 years ago

ramkikannan commented 8 years ago

For the enron email dataset, I built a topic file using NMF. For the same text dataset, I built the index using CreateIndexForPalmetto java file. For the attached topic file, when I ran the Palmetto for any score, it is always generating NaN's.

When I debugged the code, under getProbabilities in function org.aksw.palmetto.DirectConfirmationBasedCoherence.calculateCoherences, is always having count zero. Can you please let me know what is the mistake in this?

For your reference, you can download the input text file where every line is a document, source code, indexes and the topic files as a zip file from dropbox link https://dl.dropboxusercontent.com/u/45630765/palmettodebug.zip .

Appreciate your help.

Ramki

MichaelRoeder commented 8 years ago

Hi Ramki,

the problem is caused by the format of the file containing your topics. Your file is a comma separated file but palmetto expects the words to be space separated.

Thus, for the a topic containing the words "power", "project", "company", "india" and "energy", your file should contain the line power project company india energy

Please let me know whether this solved your problem.

Cheers, Michael

ramkikannan commented 8 years ago

Thanks a lot for your prompt response. When I use the wikipedia_bd file, I am getting the result now with the attached topics files. The PositionStoring Index is working fine with my dataset.

But the simple boolean index is not working. NPMI, UCI and C_P are all the zero scores, C_A and C_V are all one. Is this expected?

Also, if I want to generate topic similarities, as the average pairwise word similarity between topics, is there a way to generate now?

Kindly provide the citation information for this implementation.

MichaelRoeder commented 8 years ago

Hi Ramki,

The simple boolean index is not working with these coherences because these are window based coherences (as described here) that need the word positions (as described here). You will have to use the position storing index for these coherences.

I am not quite sure whether I got your idea right. Can you send me a mail with a more detailed description? You can find my mail address on the project website: http://palmetto.aksw.org

If you are using Palmetto for an experiment or something similar that leads to a publication, please cite the paper "Exploring the Space of Topic Coherence Measures" that you can find on the project website. A link to the project website is welcome as well :)