Refactoring for retrieval metrics

AlekseySh commented 1 year ago

The scope of work:

moving CMC, precision, and recall into separate functions (by the analogue with fnmr@fmr)
adding docs to these functions
adding these functions to sphinx documentation
adding asserts with error messages to the functions' args

dapladoc commented 1 year ago

@AlekseySh

What do you think about the following docstring format? Is it ok, or is it too much?

Screenshot_20221126_160110

AlekseySh commented 1 year ago

@dapladoc Hey!

We thought about putting formulas into doc strings. It seems like a trade-off: readability for developers vs readability for users. @DaloroAT, what do you think?

As for code examples, I like the idea, but only if we have a way to automatically test it. Perhaps, there are some existing tools.

PS. In this particular case, I have a feeling that math overcomplicates the understanding of the metric. What about this? The metric calculates the percentage of positive distances higher than a given q-th percentile of negative distances

AlekseySh commented 1 year ago

Off-topic: But these formulas were useful at least for the following. One of the goals of OML is to perform research and probably publish some results. I am working on experiments for a surrogate precision loss now. The idea is that you can express precision@k using the Heaviside function, then you replace it with sigmoid with temperature and you can differentiate it. I am thinking now about creating a differentiable version of FNMR@FMR thanks to your formulas.

DaloroAT commented 1 year ago

For me, there is no difference between developers and users in terms of OML. Everyone in OML is the same (in contrast - torch developers and users). I like the suggestion of @dapladoc , we also thought about it previously, but we had a higher priority tasks before the first release.

Probably we don't need details for each object in library (utils, patching, etc), but for hyperparameters - good idea.

Also if we have a long description for function, we can use the following pattern:

def crazy_function(a, b):
    ...
crazy_function.__doc__ = crazy_function_doc

And then we can test crazy_function_doc in the same way as we test README snippets.

DaloroAT commented 1 year ago

Or even better.

def crazy_function(a, b):
    """
    some actions ...
    Args:
        a: param1
        b: param2
    """
crazy_function.__doc__ = crazy_function.__doc__ + extra_doc_with_examples

Then test extra_doc_with_examples. It's ok for developers to see little snippet with args and cool for users to see tested example.

dapladoc commented 1 year ago

@DaloroAT And where extra_doc_with_examples should be located? Somewhere in docs/readme/? Or it is a string right in the same module as crazy_function?

AlekseySh commented 1 year ago

@dapladoc As far as I remember @DaloroAT's code from another project, he put it nearby the function

AlekseySh commented 1 year ago

@dapladoc @DaloroAT If you want to continue the discussion about a way how we check code examples, let's do it in a dedicated issue https://github.com/OML-Team/open-metric-learning/issues/241. If we agreed on some idea, we can implement it in a dedicated PR.

As for the current scope of work, let's leave the example as @dapladoc suggested here: https://github.com/OML-Team/open-metric-learning/issues/240#issuecomment-1328026114 , but without testing for now

DaloroAT commented 1 year ago

For me following options are good: 1) Keep below function

def func1(a, b):
    """
    Obligatory description for func1...
    Args:
        a: param1
        b: param2
    """
    ...

extra_docs_func1 = \
"""
Formulae for func1...
"""
func1.__doc__ = func1.__doc__ + extra_docs_func1

def func2(c):
    """
    Obligatory description for func2...
    Args:
        c: param1
    """
    ...

extra_docs_func2 = \
"""
Extra comments with snippets for func2...
"""
func2.__doc__ = func2.__doc__ + extra_docs_func2

2) Move all extra_docs of each function to a separate file docs.py on the same level, then import it.

The first approach allows us to keep the context of function in the same place, but the second approach distinguishes computational code and ideas in separate files and keeps them small. For me, 1 option is better.

AlekseySh commented 1 year ago

@DaloroAT , what do you think about latex formulas for the metric?

As one of the solutions, I can suggest keeping both: formulas and a text description from https://github.com/OML-Team/open-metric-learning/issues/240#issuecomment-1328036668 (on the top of the docs). Formulas may be placed nearby in __doc__, as @DaloroAT suggested.

DaloroAT commented 1 year ago

Latex formulae are the best decision Of course, we can use a small description (docs in the body of the function) and a detailed description (extra_docs) at the same time.

DaloroAT commented 1 year ago

I hope, readthedocs allows us to do these tricks :D

AlekseySh commented 1 year ago

@dapladoc could you show the source of the doc string above? I want to understand better how it looks like to decide if we need to split it as @DaloroAT suggested

dapladoc commented 1 year ago

@AlekseySh

The docstring

```python """ Function to compute False Non Match Rate (FNMR) value when False Match Rate (FMR) value is equal to ``fmr_vals``. The metric calculates the percentage of positive distances higher than a given :math:`q`-th percentile of negative distances. Args: pos_dist: distances between samples from the same class neg_dist: distances between samples from different classes fmr_vals: Values of ``fmr`` (measured in percents) for which we compute the corresponding ``fnmr``. For example, if ``fmr_values`` is (20, 40) we will calculate ``fnmr@fmr=20`` and ``fnmr@fmr=40`` Returns: Tensor of ``fnmr@fmr`` values. Given a vector of :math:`N` distances between samples from the same classes, :math:`u`, the false non-match rate (:math:`\\textrm{FNMR}`) is computed as the proportion below some threshold, :math:`T`: .. math:: \\textrm{FNMR}(T) = \\frac{1}{N}\\sum\\limits_{i = 1}^{N}H\\left(u_i - T\\right) = 1 - \\frac{1}{N}\\sum\\limits_{i = 1}^{N}H\\left(T - u_i\\right) where :math:`H(x)` is the unit step function, and :math:`H(0)` taken to be :math:`1`. Similarly, given a vector of :math:`N` distances between samples from different classes, :math:`v`, the false match rate (:math:`\\textrm{FMR}`) is computed as the proportion above :math:`T`: .. math:: \\textrm{FMR}(T) = 1 - \\frac{1}{N}\\sum\\limits_{i = 1}^{N}H\\left(v_i - T\\right) = \\frac{1}{N}\\sum\\limits_{i = 1}^{N}H\\left(T - v_i\\right) Given some interesting false match rate values :math:`\\textrm{FMR}_k` one can find thresholds :math:`T_k` corresponding to :math:`\\textrm{FMR}` measurements .. math:: T_k = Q_v\\left(\\textrm{FMR}_k\\right) where :math:`Q` is the quantile function, and evaluate the corresponding values of :math:`\\textrm{FNMR}@\\textrm{FMR}\\left(T_k\\right) \\stackrel{\\text{def}}{=} \\textrm{FNMR}\\left(T_k\\right)`. See: `Biometrics Performance`_. `BIOMETRIC RECOGNITION: A MODERN ERA FOR SECURITY`_. .. _Biometrics Performance: https://en.wikipedia.org/wiki/Biometrics#Performance .. _`BIOMETRIC RECOGNITION: A MODERN ERA FOR SECURITY`: https://www.researchgate.net/publication/50315614_BIOMETRIC_RECOGNITION_A_MODERN_ERA_FOR_SECURITY Example: >>> pos_dist = torch.tensor([0, 0, 1, 1, 2, 2, 5, 5, 9, 9]) >>> neg_dist = torch.tensor([3, 3, 4, 4, 6, 6, 7, 7, 8, 8]) >>> fmr_vals = (10, 50) >>> calc_fnmr_at_fmr(pos_dist, neg_dist, fmr_vals) tensor([0.4000, 0.2000]) ```

dapladoc commented 1 year ago

Here are some references that might be helpful https://numpy.org/doc/stable/reference/generated/numpy.linalg.svd.html https://pytorch.org/docs/stable/generated/torch.pca_lowrank.html https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html?highlight=pca#sklearn.decomposition.PCA

AlekseySh commented 1 year ago

@dapladoc thanks for sharing. Yes, it's a bit tricky to read, but it's the same in other libraries. Thus, I am not sure that we can benefit from splitting docs into several parts via using __doc__. @DaloroAT , what would you say?

DaloroAT commented 1 year ago

Ok, let's keep full docs in functions

AlekseySh commented 1 year ago

Okay, then, it's time to create a PR and do the same and in the same format for the rest of the functions as @dapladoc did for FNMR

OML-Team / open-metric-learning

Refactoring for retrieval metrics #240