dice-group / Palmetto

Palmetto is a quality measuring tool for topics
GNU Affero General Public License v3.0
209 stars 36 forks source link

Default Gamma set as 2 (should be 1) #81

Closed jararap closed 1 year ago

jararap commented 1 year ago

Dear Michael, thank you for this wonderful library :thumbsup:, I was investigating the issue where C_v is negatively correlated with other metrics, and realised that the hyper-parameter ɣ was set to 2 as opposed to the suggested 1 in your paper. I believe this is the cause of #76 and #12.

Examples: T1 = fan player game playoff play score team hit win season has a C_v(​ɣ=1) score of 0.8394 compared to C_v(ɣ=2) score of 0.5629.

T2 = accumulation fossil hydrocarbon methane mud oil petroleum sand sediment shell C_v(​ɣ=1): 0.7681 C_v(​ɣ=2): 0.482 c_npmi: 0.1977

T3 = ambitious amnesty definite desirable distraction entail entrant funding unauthorized undermine C_v(​ɣ=1): 0.3822 C_v(​ɣ=2): 0.6244 c_npmi: -0.3231

T2 and T3 is interesting because C_v(​ɣ=1) and C_v(​ɣ=2) contradict each other. As ongoing research, I am unable to show my data here, but I can confirm that correlation-wise for Palmetto, C_v(​ɣ=1) is positively correlated with other metrics (except C_A because I did not test it) while C_v(​ɣ=2) has a negative correlation.

Unfortunately :weary:, it seems that ɣ was set to 2 for as far back as 2016 on github. I am unsure if it is the same on the endpoint when it was up. The docker container provided also has ɣ set to 2.

Silver lining :smiley:: your paper is correct! C_v(​ɣ=1) is usable and can be recommended again! I will still recommend Palmetto over gensim 👍

Cheers JP

MichaelRoeder commented 1 year ago

Thank you for your interest in Palmetto and for pointing out this bug. It seems like you are right. This seems to be a bug :bug: I will look into it.

Notes for myself:

MichaelRoeder commented 1 year ago

The new version is released and deployed. Thanks again for opening the ticket and reporting the issue. :+1: