Negative values, when calculating JS-divergence with KNN density estimation

epurdom / GloScope

3 stars 0 forks source link

Negative values, when calculating JS-divergence with KNN density estimation #3

Closed VladimirShitov closed 1 year ago

VladimirShitov commented 1 year ago

Hi, and thanks again for your package! Looking forward to a new Bioconductor release with it.

I ran gloscope on the COMBAT dataset with parameters dens = "KNN" and dist_mat="JS". Surprisingly, quite a few values in the resulting matrix turned out to be negative. Here is the histogram of a flattened matrix:

The interpretation is unclear, and I wonder if this is an expected behavior.

wtorous commented 1 year ago

Hello Vladimir,

I apologize for the long delay, I did not have notifications correctly set up. These negative values are an expected behavior for kNN density estimation; see Figure 7 in "On Accuracy of PDF Divergence Estimators and Their Applicability to Representative Data Sampling" by Budka, Gabrys, and Musial (2011) for illustration.

We chose to leave these negative values in the output matrix and allow users to decide how to handle them, for instance by rounding to a small positive value or censoring with NA. We have updated the vignette to be more explicit about this choice. Thank you for your continued interest in our package!

Sincerely, Will

VladimirShitov commented 1 year ago

Thank you Will! This is fair.

As far as I understood from eq. 16 in the paper, and from the code, negative values in the sum appear, when distances to the nearest neighbors in different sample are less than to the nearest neighbors in the same sample. I believe, for most applications (e.g., clustering of the samples) setting the distance to 0 would make sense in such case.

wtorous commented 1 year ago

I agree with your interpretation that negative divergences arise when the within sample nearest neighbor distances are generally larger than the across sample distances. Using a zero divergence seems reasonable to me here, with the caveat to not interpret this as the distributions being exactly equal.