Fix agreement_numerical() for multiple stimuli

hagenw commented 2 months ago

Addresses the problem of wrong output shape of audpsychometric.agreement_numerical() as described in https://github.com/audeering/audpsychometric/pull/13#discussion_r1734784060.

Ensure tests fail for wrong shape (also updated for other functions)
Fixed returned output shape
Changed behavior for nan values, which are now ignored
Removed support for Python 3.8 to use numpy 2.0 in tests (I think it makes no sense to spend any effort on Python 3.8)

hagenw commented 2 months ago

I still think that these are drop-in replacements for single item (reliability) coefficients proper, and I am still not convinced that they optimally located in the gold standard module

I'm not sure about this, as we use the functions to add a confidence/agreement value to the gold standard (e.g. mean/median) we calculate for each sample. Whereas the reliability coefficients return a single value over all stimuli and rater, or would it make sense to call them on a per stimulus base as well?

hagenw commented 2 months ago

The only thing that I am still wondering is the calculation of the standard deviation for now: If we are calculating the sample standard deviation, then we divide by n-1, one less than the number of data values. The current implementation calls np.nanstd with ddof=0 , i.e. the population standard deviation.

I wonder about the motivation for this? My intuition would be that we cannot generalize to the entire population of raters here and I would gussed that the sample std would be in place. Should this motivation be documented, or should the np.nanstd be parametrizable when it is called?

The original discussion is at https://gitlab.audeering.com/data/msppodcast/-/merge_requests/23#note_196359.

The reasoning goes like this: when a model should learn the confidence for a rating it should only depend on the audio signal and not on the number of raters that have judged the sample, and we have:

>>> # sample standard deviation
>>> np.std([0, 0, 0.2, 0.4], ddof=1)
0.19148542155126763
>>> np.std([0, 0, 0, 0, 0.2, 0.2, 0.4, 0.4], ddof=1)
0.17728105208558367
>>> # population standard deviation
>>> np.std([0, 0, 0.2, 0.4], ddof=0)
0.16583123951777
>>> np.std([0, 0, 0, 0, 0.2, 0.2, 0.4, 0.4], ddof=0)
0.16583123951777

hagenw commented 2 months ago

To be in line with audpsychometric.agrrement_categorical(), I added a discussion to the docstring that nan is ignored.

hagenw commented 2 months ago

I still think that these are drop-in replacements for single item (reliability) coefficients proper, and I am still not convinced that they optimally located in the gold standard module. I encounter that this is a stab into a hornet's nest, but it might be worthwile to discuss it - later.

Feel free to open an issue for that.

hagenw commented 2 months ago

The only thing that I am still wondering is the calculation of the standard deviation for now: If we are calculating the sample standard deviation, then we divide by n-1, one less than the number of data values. The current implementation calls np.nanstd with ddof=0 , i.e. the population standard deviation. I wonder about the motivation for this? My intuition would be that we cannot generalize to the entire population of raters here and I would gussed that the sample std would be in place. Should this motivation be documented, or should the np.nanstd be parametrizable when it is called?

The original discussion is at https://gitlab.audeering.com/data/msppodcast/-/merge_requests/23#note_196359.

The reasoning goes like this: when a model should learn the confidence for a rating it should only depend on the audio signal and not on the number of raters that have judged the sample, and we have:
>>> # sample standard deviation
>>> np.std([0, 0, 0.2, 0.4], ddof=1)
0.19148542155126763
>>> np.std([0, 0, 0, 0, 0.2, 0.2, 0.4, 0.4], ddof=1)
0.17728105208558367
>>> # population standard deviation
>>> np.std([0, 0, 0.2, 0.4], ddof=0)
0.16583123951777
>>> np.std([0, 0, 0, 0, 0.2, 0.2, 0.4, 0.4], ddof=0)
0.16583123951777

@ChristianGeng if you think this is not the correct approach, please open an issue, and we can discuss their. In this pull request, I would propose to focus on fixing the shape bug.

ChristianGeng commented 2 months ago

@ChristianGeng if you think this is not the correct approach, please open an issue, and we can discuss their. In this pull request, I would propose to focus on fixing the shape bug.

As you say: mafor discussions are beyond the scope of the fix here. I will happily open a new, more conceptual issue once my thought process converges into something meaningful - if it ever does ;-)
I will check whether this is already approved - and otherwise do so.

audeering / audpsychometric

Fix agreement_numerical() for multiple stimuli #18