Decrease memory foot print of computing FMR metric

AlekseySh commented 1 year ago

Initial discussion is in https://github.com/OML-Team/open-metric-learning/pull/250

AlekseySh commented 1 year ago

from @dapladoc :

I thought about some algorithm to find quantiles online, in a batch mode. I found that there is P^2 algorithm (link) and here is a discussion about it (link). Then I found more articles about quantile estimation: see references to this article (link). It looks like all these algorithms work only element-by-element. So, no way to parallelize it. I'm not a specialist in this, though. Maybe it could be done effectively in cython. I'll try to implement the P^2 (or some other algorithm) in cython.

Another solution is to approximate the quantile (and the fnmr@fmr) by means of histogram, that could be computed incrementally. I can try to estimate accuracy of this approach on some syntetic data.

At the moment I don't have other ideas, but I will think on it, and maybe something new will come up.

AlekseySh commented 1 year ago

from me:

@dapladoc thank you for your thoughts and links! please, don't start working on cython implementation (except if you really want this), let's discuss it for now. I am afraid that the possible profit is way less than the potential effort needed to implement this algorithm. Let's just discuss possibilities for now.

I like the idea of an approximation. I thought about an even simpler solution: we can consider only the part of pairwise distances if the matrix is too big. For example, randomly pick 10% or something like this.

It's still unclear what is the problem: the footprint of copying distances or memory overhead inside the quantile algorithm.

dapladoc commented 1 year ago

I like the idea of an approximation. I thought about an even simpler solution: we can consider only the part of pairwise distances if the matrix is too big. For example, randomly pick 10% or something like this.

I don't like this idea. It is very uncontrollable approach. I think there could be very high variance for this approach.

Here is a small simulation. I drew pos_dist and neg_dist from the gamma distribution. The size of pos_dist is 1000, and the size of neg_dist is 1,000,000. Then in a loop I take randomly 10% of negative distances, evaluate fnmr@fmr and store it. Afterwards, I plot histograms of fnmr@fmr approximations and the true fnmr@fmr for different fmr_vals.

The simulation of fnmr@fmr approximation on random subsets.

```python pos_dist = torch.from_numpy(np.random.gamma(shape=1, scale=1, size=1000)) neg_dist = torch.from_numpy(np.random.gamma(shape=6, scale=1, size=1_000_000)) fmr_vals = (0.1, 0.5, 1, 5) fnmr_true = calc_fnmr_at_fmr(pos_dist, neg_dist, fmr_vals).numpy() fnmr_approx = [] for i in tqdm(range(10000)): _neg_dist = neg_dist[torch.rand(neg_dist.shape) > 0.9] fnmr_approx.append(calc_fnmr_at_fmr(pos_dist, _neg_dist, fmr_vals).tolist()) fnmr_approx = np.array(fnmr_approx) fig, ax = plt.subplots(len(fnmr_true)) for i in range(len(fnmr_true)): ax[i].hist(fnmr_approx[:, i], bins=100) ax[i].axvline(fnmr_true[i], color='r') ax[i].set_title(f'fmr_val = {fmr_vals[i]}') print(f'Std(fnmr@fmr({fmr_vals[i]})) = {fnmr_approx[:, i].std():.4e}') print(f'Bias(fnmr@fmr({fmr_vals[i]})) = {np.mean(np.abs(fnmr_approx[:, i] - fnmr_true[i])):.4e}') fig.set_size_inches(6.4, 4.8 * 3) ```

And here is its output

Std(fnmr@fmr(0.1)) = 5.5389e-01
Bias(fnmr@fmr(0.1)) = 4.0927e-01
Std(fnmr@fmr(0.5)) = 3.2004e-01
Bias(fnmr@fmr(0.5)) = 2.2448e-01
Std(fnmr@fmr(1)) = 1.5432e-01
Bias(fnmr@fmr(1)) = 6.6530e-02
Std(fnmr@fmr(5)) = 9.9451e-02
Bias(fnmr@fmr(5)) = 7.1290e-02

Here are histograms of fnmr@fmr approximations. The red lines are "ground truth".

fnmr_approximations_errors

Usually we are interested in fnmr@fmr for small values of fmr, and it will have the biggest bias.

AlekseySh commented 1 year ago

The graphs above seem okay to me, it's I consider fnmr@fmr only as an additional (auxiliary) metric. So, it's not super important 30% or 32% of positive distances are smaller than 10% of the negative ones. Of course, If I submit these results somewhere, it's not accurate enough.

AlekseySh commented 1 year ago

If we want to work on this problem systematically, we have to measure the following first (for example, on SOP dataset):

Memory is needed to compute 3 basic metrics (CMC, precision, ap)
Memory is needed to do p.1 + copy the positive and negative distances
Memory is needed to do p.2 + compute fnm@fmr metric

Then it would be clear what and how should be optimized

PS. My IMHO is that we can postpone the problem for now (https://github.com/OML-Team/open-metric-learning/pull/250 is enough) and focus on the tasks with higher priorities. In the end, it's probably not optimal to optimize fmr metric without (at least) having examples from the verification domain.

dapladoc commented 1 year ago

The graphs above seem okay to me, it's I consider fnmr@fmr only as an additional (auxiliary) metric. So, it's not super important 30% or 32% of positive distances are smaller than 10% of the negative ones. Of course, If I submit these results somewhere, it's not accurate enough.

It really depends on the domain. For security related tasks (i.e. biometrics) 99.9% and 99.99% are noticebly different results.

PS. My IMHO is that we can postpone the problem for now (#250 is enough) and focus on the tasks with higher priorities. In the end, it's probably not optimal to optimize fmr metric without (at least) having examples from the verification domain.

I agree with you. There should be more thorough memory measurements done in order to find possible ways to solve this issue.

AlekseySh commented 2 months ago

A bit optimizer it along with OML 3.0 release (removed unwilling convertations): https://github.com/OML-Team/open-metric-learning/blob/main/oml/functional/metrics.py#L375

OML-Team / open-metric-learning

Decrease memory foot print of computing FMR metric #251