Open darked89 opened 2 years ago
Hi @darked89,
Unfortunately, Affinity Propagation does not "natively" support NaN representations in its clustering algorithm due to the similarity calculation which requires that:
s(i, j) > s(i, k) iff xi is more similar to xj than to xk
NaN comparisons are typically considered undefined behavior (in fact, this software's type system defines the output of the comparison with NaN to return None
instead of an actual value).
Preprocessing the data to replace these values (for example, with median expression, with 0, or some other decimal value) and/or removing rows/cols may be your best bet.
That being said, if there is a similarity calculation that can handle NaN input to return decimal values, I would be willing to implement it as part of this software.
Best,
Chris
Hi Chris,
Thank you for an explanation. Dropping the rows with NaNs, or replacing NaNs be it with 0s or say median/medium values in a column should be good enough in most of the cases. This is doable using popular libs, be it in Python or R, so imho there is no desperate need to have it inside affinityprop
.
brainstorming:
There may be a more laborious way to fix NaN values (staying with the initial dataset of genes x samples):
I have no idea if this would give more accurate clustering of samples/genes. Still, it should be better for fixing NaNs in cases where some gene X in cluster sample Y has expression values far from median for the whole set.
Best,
Darek Kedra
Hi @darked89,
If we consider a much smaller space (2D for example), we can image a scenario in which two clusters are present at similar x-values, but spaced on top of each other along the y-axis. In this case, attempting to calculate similarity as you have laid out would create situations in which we will either 1) arbitrarily choose a cluster to assign the value, or 2) splits the difference of the y-coordinate, potentially adding a third cluster.
Ultimately, the nan
reduces the row's dimensions, and these cases could extrapolate to higher dimensions where we have a plane/hyperplane of potential clusters for each data point, so I don't think this is the way to go.
I have implemented a "warn-but-allow" system in this repo's v0.3.0-alpha
branch. This version allows users to include nan
values in their input (so long as the row is not all nan
).
I have added nan
-safe similarity calculations that drop nan
values prior to computing its final calculations
let mut row_diff = a - b; // nan values may be present in each row
row_diff.map_inplace(|_a| *_a = (*_a).powi(2)); // nan values may be present
row_diff.mapv_inplace(|v| if v.is_nan() { 0.0 } else { v }); // Any nan values are replaced by 0
-1.0 * row_diff.sum() // Total similarity does not include comparisons from nan, since these are all 0
I would be interested to see how this impacts predictions. In the tests I have added, I note that the predictive ability drops when high percentages of rows have nan
in their input.
There is a small tradeoff in that a value like 0 could potentially be a large total similarity. An argument could be made to replace the dropped-nan
values with the preference than with 0.
For now, this version will remain in the v0.3.0-alpha
branch. It is installable with the command:
cargo install --git https://github.com/cjneely10/affinityprop --branch v0.3.0-alpha
I'd be curious to see how this impact real-world data.
Thanks, Chris
Hello,
I have a fairly large gene expression matrix (~20k genes, 2k samples) with luckily just a handful of NA values. At the moment the
affinityprop
just stops reporting an parsing error if the input contains NA value somewhere.I just dropped a small number (<20 ) out of 20k rows and kept on going, so this is not urgent by any means. At the same time I expect to get data with way more NAs, so figuring out how to handle these better (if possible..) is important.
Apart from just dropping either rows (genes at this point) or columns (patient data), is there some other way of clustering such data? I am reluctant to replace NAs with say median expression value for the gene since this may interfere with clustering.
Would it be possible to get something like a input data parsing log:
Best wishes,
Darek Kedra