Open lindsaydbrin opened 2 years ago
Hello,
Yeah I think this is a great idea.
we always work with distance, so this should work with cosine and with euclidean. We would need to inverse in both cases.
I think we could add a new parameter to CumulativeGradientEstimator to select the algorithm. By default we would use the Parzen-Window.
I hope this help and let's chat if there are issues with the implementation.
I'd like to add a new approach to estimating class overlap, but have questions about how best to implement this.
Currently, per-sample class overlap/probabilities are calculated (via
compute_expectation_with_monte_carlo()
) by:This is then added to
expectation
(to build the S-matrix), which is eventually normalized by total mass of the row.This approach considers the total area of the kNNs (via Parzen-window-normalization), but not the distances/similarities to each neighbor. One consequence of this approach (desired or not) is that the farthest neighbor could have a disproportionate impact on the overlap with another class. E.g., compare two cases where a sample in class A has all neighbors in class B, but in (1) all samples are at a moderate distance, whereas in (2) all samples but one are very close, but one is very far. If I'm understanding correctly (and correct me if I'm wrong), (1) would appear to have greater overlap because of the smaller Parzen window, whereas one might prefer for (2) to be measured to have greater overlap because most of its neighbors are much closer and in theory it should be harder to determine a class boundary.
For comparison, I'd like to try to calculate per-sample class overlap/probabilities by the following:
This would then be added to something similar to
expectation
but be divided by sample count rather than normalized by row mass, so that the relative similarity information would be maintained and one could compare across rows.I.e., the original code does this:
where
probability
is 1. above, andprobability_norm
is 2. above. This is then added toexpectation
, which is later normalized asexpectation[class_ix] /= expectation[class_ix].sum()
.My suggestion is to do this:
where
sim_scores
is 3. above. This would then be added toexpectation
, which would be normalized asexpectation[class_ix] /= len(class_indices[class_ix])
.Apologies for any confusion from variable names from my current PR rather than the original code! I have this working (seemingly correctly) within the function, although it breaks code downstream as described below.
I have a couple of implementation questions, basically:
compute_expectation_with_monte_carlo()
?Some thoughts:
SimilarityArrays
dataclass).compute_expectation_with_monte_carlo()
really easily. If you want it to only be done with cosine similarity, I don't know if you'd want to throw an error if the wrong distance metric were selected vs. silently ignoring (which seems like a bad idea), or something else?