IBM / Hestia-OOD

Independent evaluation set construction for trustworthy ML models in biochemistry
https://ibm.github.io/Hestia-OOD/
MIT License
7 stars 1 forks source link

AU-GOOD for target distribution #33

Closed RaulFD-creator closed 3 days ago

RaulFD-creator commented 1 month ago

Motivation Final step of the pipeline relies on given a set of results expressed as a dictionary of {threshold: metrics_dict}, a target value to calculate the AU-GOOD with, a target dataset, and the set of similarity metrics used to calculate the partitions. It then calculates the similarity between the original data and the target dataset.

Possible implementation

  1. Calculate similarities between query and target distributions, using already implemented functions.
  2. Create a histogram with the same min and max value as the partitions, and with the same step for the number of bins.
  3. Normalise the histogram (counts / counts.sum())
  4. To get the AU-GOOD, perform dot product between normalise counts and values. This is equivalent to sum(a*b), which is the finite form of the integral AU-GOOD integral.

Alternatives 4b. Calculate a*b and sum(a*b), separately so that a user can have access to the GOOD curve to represent if they so desire.