Closed VladimirShitov closed 1 year ago
Hello Vladimir,
I apologize for the long delay, I did not have notifications correctly set up. These negative values are an expected behavior for kNN density estimation; see Figure 7 in "On Accuracy of PDF Divergence Estimators and Their Applicability to Representative Data Sampling" by Budka, Gabrys, and Musial (2011) for illustration.
We chose to leave these negative values in the output matrix and allow users to decide how to handle them, for instance by rounding to a small positive value or censoring with NA
. We have updated the vignette to be more explicit about this choice. Thank you for your continued interest in our package!
Sincerely, Will
Thank you Will! This is fair.
As far as I understood from eq. 16 in the paper, and from the code, negative values in the sum appear, when distances to the nearest neighbors in different sample are less than to the nearest neighbors in the same sample. I believe, for most applications (e.g., clustering of the samples) setting the distance to 0 would make sense in such case.
I agree with your interpretation that negative divergences arise when the within sample nearest neighbor distances are generally larger than the across sample distances. Using a zero divergence seems reasonable to me here, with the caveat to not interpret this as the distributions being exactly equal.
Hi, and thanks again for your package! Looking forward to a new Bioconductor release with it.
I ran gloscope on the COMBAT dataset with parameters
dens = "KNN"
anddist_mat="JS"
. Surprisingly, quite a few values in the resulting matrix turned out to be negative. Here is the histogram of a flattened matrix:The interpretation is unclear, and I wonder if this is an expected behavior.