Dref360 / spectral-metric

Code for the CVPR 2019 paper : Spectral Metric for Dataset Complexity Assessment
https://dref360.github.io/spectral-metric
MIT License
45 stars 4 forks source link

Additional/new approach to estimating class overlap #8

Open lindsaydbrin opened 2 years ago

lindsaydbrin commented 2 years ago

I'd like to add a new approach to estimating class overlap, but have questions about how best to implement this.

Currently, per-sample class overlap/probabilities are calculated (via compute_expectation_with_monte_carlo()) by:

  1. Tallying class count for the kNNs (into an array)
  2. Dividing the entire array by the Parzen-window

This is then added to expectation (to build the S-matrix), which is eventually normalized by total mass of the row.

This approach considers the total area of the kNNs (via Parzen-window-normalization), but not the distances/similarities to each neighbor. One consequence of this approach (desired or not) is that the farthest neighbor could have a disproportionate impact on the overlap with another class. E.g., compare two cases where a sample in class A has all neighbors in class B, but in (1) all samples are at a moderate distance, whereas in (2) all samples but one are very close, but one is very far. If I'm understanding correctly (and correct me if I'm wrong), (1) would appear to have greater overlap because of the smaller Parzen window, whereas one might prefer for (2) to be measured to have greater overlap because most of its neighbors are much closer and in theory it should be harder to determine a class boundary.

For comparison, I'd like to try to calculate per-sample class overlap/probabilities by the following:

  1. Summing the kNN's cosine similarities per class (into an array) (i.e., sum similarities rather than tally)

This would then be added to something similar to expectation but be divided by sample count rather than normalized by row mass, so that the relative similarity information would be maintained and one could compare across rows.

I.e., the original code does this:

            probability = np.array(
                # kNN class proportions
                [(target_k == nghbr_class).sum() / k_nearest for nghbr_class in range(n_class)]
            )
            # Get the Parzen-window probability
            probability_norm = probability / get_volume(data[indices_k])  # Parzen-window normalized

where probability is 1. above, and probability_norm is 2. above. This is then added to expectation, which is later normalized as expectation[class_ix] /= expectation[class_ix].sum().

My suggestion is to do this:

            sim_scores = np.array([similarities_k[target_k == nghbr_class].sum() for nghbr_class
                                  in range(n_class)])

where sim_scores is 3. above. This would then be added to expectation, which would be normalized as expectation[class_ix] /= len(class_indices[class_ix]).

Apologies for any confusion from variable names from my current PR rather than the original code! I have this working (seemingly correctly) within the function, although it breaks code downstream as described below.

I have a couple of implementation questions, basically:

Some thoughts:

Dref360 commented 2 years ago

Hello,

Yeah I think this is a great idea.

we always work with distance, so this should work with cosine and with euclidean. We would need to inverse in both cases.

I think we could add a new parameter to CumulativeGradientEstimator to select the algorithm. By default we would use the Parzen-Window.

  1. I think in your example, 2 would replace 3 when specified.
  2. The S matrix should still work as it is just a similarity matrix.
  3. We can run some benchmarks once we have the code running and decide.
  4. I would try to accommodate every distance metric supplied.

I hope this help and let's chat if there are issues with the implementation.