Reproducing results from the java implementation

azehe commented 3 years ago

Thanks for the package!

Problem I'm currently trying to switch from the "official" implementation (https://gamma.greyc.fr/) to this one, but I'm having trouble getting to the same results.

Reproducing As an example, I tried the Alex, Paul, Suzan data from the java web app. Converted to the format the continuum expects, it looks as follows:

Alex,1,2,12
Alex,2,13,19
Alex,5,24,30
Alex,6,32,36
Alex,6,36,44
Alex,7,49,60
Paul,1,2,9
Paul,3,11,17
Paul,5,19,25
Paul,6,32,44
Suzan,1,2,9
Suzan,4,11,17
Suzan,5,21,27
Suzan,6,32,36
Suzan,6,36,40
Suzan,6,42,46
Paul,7,48,58
Suzan,7,48,58

In the web app, this gives me γ = 0.451034437799.

Using the same data in your implementation gives a different value for gamma:

from pyannote.core import Segment
from pygamma_agreement import Continuum, CombinedCategoricalDissimilarity

continuum = Continuum.from_csv("test_data/aps.csv")
dissim = CombinedCategoricalDissimilarity(list(continuum.categories), alpha=1, beta=1)
gamma_results = continuum.compute_gamma(dissim, precision_level=0.01)

print(f"The gamma for that annotation is f{gamma_results.gamma}")

The gamma for that annotation is f0.5044930080375523

I've also tried my own data, where the results sometimes differ even more (negative vs. positive).

I'm using alpha=1 and beta=1, which, as I understand from the Gamma paper, seem to be the default values. However, I'm not sure whether these are the values used in the java implementation and haven't managed to find out.

Is there any parameter I'm missing or setting to a wrong value?

Environment I'm using pygamma_agreement==0.1.6 with python 3.8.7 on Fedora 33.

PS: Curiously, I just noticed that I'm also getting different results from the java web app and the java offline app, which gives me gamma=0.55

Rachine commented 3 years ago

Hi, thank you for issue!

The software from Mathet et al. https://gamma.greyc.fr/ is not using the parameters that have been mentionned in the paper. We found the same discrepancy as you when we did the re-implementation.
The software is using alpha=1 et beta=3. It was confirmed to us by the original authors for the alpha and beta params. Besides, there might be slight differences +- 0.01 with their implementation. We have one main suspect: The shuffling/sampling methodology they mentionned in their paper might be hard to replicate without access to their code. So we made the most rationale choices.

azehe commented 3 years ago

Hi, thanks for the quick reply! With these parameters, I'm getting gamma=0.405, which is still a good bit off from the java result. With my own data, the difference is much larger: gamma=-0.07 (-0.13 <= gamma <= -0.02) for the java version and gamma=0.122 for the python version. I can upload a sample of my data if that helps. Any suggestions on how I could debug this further? Is that in the range that you would expect from the shuffling method?

Rachine commented 3 years ago

Yes a sample of your data might help us to understand how you get this discrepancy indeed!

Hi, thanks for the quick reply! With these parameters, I'm getting gamma=0.405, which is still a good bit off from the java result. With my own data, the difference is much larger: gamma=-0.07 (-0.13 <= gamma <= -0.02) for the java version and gamma=0.122 for the python version. I can upload a sample of my data if that helps.

If i well understood, you obtained the negative gamma value -0.07 with the java version?

Any suggestions on how I could debug this further? Is that in the range that you would expect from the shuffling method?

What range are you referring to ?

azehe commented 3 years ago

Yes a sample of your data might help us to understand how you get this discrepancy indeed!

This is the data I'm currently using: continuum.csv.gz Note that I'm just experimenting with this data and comparing a simple baseline method to manual annotations, therefore the agreement is expected to be low.

If i well understood, you obtained the negative gamma value -0.07 with the java version?

Exactly. It also gives a range (probably a kind of confidence interval), which is the -0.13 <= gamma <= -0.02 that I reported.

What range are you referring to ?

You said that there could be slight differences to the original implementation, that's what I meant by "range".

Rachine commented 3 years ago

Hi,

After investigations, your low value of gamma might come that there are many splits of your pred timeline (a segment t1-tn transformed in t1-t2, t2-t3,... t(n-1)-tn) This type error is heavily penalized by the gamma as it is looking for an alignment. Besides, the gamma agreement has been designed to be an agreement between two annotators, not exactly as a metric for ML systems. Therefore, we think that the chance should not depend on the pred timeline, to be able to compare all systems. we implemented this option ground_truth_annotators but it is not documented: https://github.com/bootphon/pygamma-agreement/blob/master/pygamma_agreement/continuum.py#L173

As mentioned in this other issue https://github.com/bootphon/pygamma-agreement/issues/16 the use of the gamma as a metric remains an open research question we think.

ghost commented 3 years ago

As update v0.2.0 fixes any problems regarding differences between the java implementation and ours', and those differences are explained in the new "Issues" section of the documentation, this issue is outdated.

bootphon / pygamma-agreement

Reproducing results from the java implementation #19