dice-group / Palmetto

Palmetto is a quality measuring tool for topics
GNU Affero General Public License v3.0
209 stars 36 forks source link

Not being able to replicate coherence scores from paper #13

Open NadineH1990 opened 7 years ago

NadineH1990 commented 7 years ago

Dear all,

For my research I want to evaluate a new semantic coherence measure with the ones available in Palmetto, especially C_V and C_A. I'm trying to replicate some results described in your paper ("Exploring the Space of Topic Coherence Measures"), by using the topics and human ratings that you have published. However, I'm not able to replicate the same Pearson correlation scores as in Table 2. Having a closer look, I found that I'm also not able to replicate the coherence scores as shown in Table 8 of the paper. In this table the following coherence scores, using the measure C_V, are displayed:

0.94 company sell corporation own acquire purchase buy business sale owner 0.91 age population household female family census live average median income 0.86 jewish israel jew israeli jerusalem rabbi hebrew palestinian palestine holocaust

Running Palmetto as jar and using the wikipedia_bd downloaded from the link on Github and the above topics in a .txt file, I get the following scores:

0.52072 company sell corporation own acquire purchase buy business sale owner 0.75174 age population household female family census live average median income 0.73356 jewish israel jew israeli jerusalem rabbi hebrew palestinian palestine holocaust

Using the web service I also get different scores:

http://palmetto.aksw.org/palmetto-webapp/service/cv?words=company%20sell%20corporation%20own%20acquire%20purchase%20buy%20business%20sale%20owner

http://palmetto.aksw.org/palmetto-webapp/service/cv?words=age%20population%20household%20female%20family%20census%20live%20average%20median%20income

http://palmetto.aksw.org/palmetto-webapp/service/cv?words=jewish%20israel%20jew%20israeli%20jerusalem%20rabbi%20hebrew%20palestinian%20palestine%20holocaust

Am I making a mistake somewhere? How can the scores here be different from the C_V scores displayed in the paper?

MichaelRoeder commented 7 years ago

Thanks for using Palmetto and pointing out this important difference. Sorry, that I can not give a direct answer right now but I hope that I will have some time during the weekend to search for the reason of this difference.

NadineH1990 commented 7 years ago

Great, I will be patiently waiting! :)

MichaelRoeder commented 7 years ago

I can confirm, that I am encountering the same problem. I couldn't replicate the results from table 8, too. I also tried to reproduce the correlation values for the NYT topics, the C_V coherence and Wikipedia as reference coprus, but I got 0.781 instead of 0.803. Until now, I couldn't find a reason why the results are different.

Sorry, that I couldn't bring much more light into this problem, until now.

On which OS are you executing the problem?

NadineH1990 commented 7 years ago

I'm using Windows 7 (64 bit). I got a much lower coherence when trying to reproduce the correlation values for the NYT topics. But probably I did something wrong there myself.

MichaelRoeder commented 7 years ago

Feel free to reuse/check my correlation implementations. You can find them at https://github.com/AKSW/Palmetto/tree/master/palmetto/src/main/java/org/aksw/palmetto/evaluate/correlation

ghost commented 6 years ago

Are there any more information available on what is causing the difference between values in the paper and the ones calculated using the provided source.

I get similar but not exactly the same values as the ones provided in the original issue. This is using the wikipedia index provided at http://139.18.2.164/mroeder/palmetto/ and compiling the library locally as a jar. (I get these values using the provided jar aswell) I am on a Linux system with Fedora 27. The values i get are: 0.51207 company, sell, corporation, own, acquire, purchase, buy, business, sale, owner 0.75174 age, population, household, female, family, census, live, average, median, income 0.73356 jewish, israel, jew, israeli, jerusalem, rabbi, hebrew, palestinian, palestine, holocaust The values are the same as the ones returned by the web service, see the original issue.

I also checkout out and older version aa8b650be8b1470a183c34614a06a6c7c106309b and compiled that version. Using this version I also got the exact same values as current version.

Is there a known commit of the library the returns the same values as the paper? That would greatly help in troubleshooting what is causing these differences.

Or could the difference depend on the Wikipedia_db version, but that version provided is dated May 2014. So it should not have changed right?

MichaelRoeder commented 6 years ago

Sorry, I still couldn't figure out where the difference comes from. The implementation itself does not seem to cause the problem. I also made sure that for the examples posted above there is no influence from a lemmatizer in the preprocessing.

So there are two sources left: 1) the wikipedia index. But as far as I remember, the index that is online should be the index that we used for the calculations. I know that we exchanged the index during the experiments. The old index was slightly larger because it contained the content of tables. However, we decided to remove tables from the wikipedia documents because the words in the single cells might not have a direct relation to each other. However, I think that was before calculating the final numbers and thus it shouldn't have any influence. 2) for the calculation of the numbers, Palmetto has been used as a library for a small program, that is not part of the github repository. This program was simply more optimized regarding the empirical exploration of the space of possible coherences, i.e., it tried to reuse intermediate results to reduce the overall runtime. It might be possible that this program created a side effect that influenced the final result. Although I am not convinced that this happened, it might be possible and something that I can check again.

In general, it seems like C_V has also a drawback described in #12. It behaves not very good when it is used for randomly generated word sets.

So finally I would suggest to use C_P, NPMI or UCI for evaluating topics.

Lehas-sudo commented 1 year ago

So is the stande we should not use Cv to evaluate lda topic models? or just if the corpus size is small?

MichaelRoeder commented 1 year ago

The main issue with respect to our implementation is fixed with (#81). It turned out that a parameter was implement wrong and, hence, C_V should a strange behavior. After the fix, tests showed that C_V works as it should, i.e., although the exact values described in this issue are still not being reproduced, the Pearson correlation values of C_V fit to the values reported in the paper.

Does that answer your question?