cjneely10 / affinityprop

Vectorized and Parallelized Affinity Propagation
GNU General Public License v3.0
3 stars 1 forks source link

handling NA values in the input #42

Open darked89 opened 2 years ago

darked89 commented 2 years ago

Hello,

I have a fairly large gene expression matrix (~20k genes, 2k samples) with luckily just a handful of NA values. At the moment the affinityprop just stops reporting an parsing error if the input contains NA value somewhere.

I just dropped a small number (<20 ) out of 20k rows and kept on going, so this is not urgent by any means. At the same time I expect to get data with way more NAs, so figuring out how to handle these better (if possible..) is important.

Apart from just dropping either rows (genes at this point) or columns (patient data), is there some other way of clustering such data? I am reluctant to replace NAs with say median expression value for the gene since this may interfere with clustering.

Would it be possible to get something like a input data parsing log:

X rows out of Y with at least one NA value
P columns out of Q with at least one NA value

dropping X rows / bailing out if X/Y > some_threshold (10%? 1/3?)

Best wishes,

Darek Kedra

cjneely10 commented 2 years ago

Hi @darked89,

Unfortunately, Affinity Propagation does not "natively" support NaN representations in its clustering algorithm due to the similarity calculation which requires that:

s(i, j) > s(i, k) iff xi is more similar to xj than to xk

NaN comparisons are typically considered undefined behavior (in fact, this software's type system defines the output of the comparison with NaN to return None instead of an actual value).

Preprocessing the data to replace these values (for example, with median expression, with 0, or some other decimal value) and/or removing rows/cols may be your best bet.

That being said, if there is a similarity calculation that can handle NaN input to return decimal values, I would be willing to implement it as part of this software.

Best,

Chris

darked89 commented 2 years ago

Hi Chris,

Thank you for an explanation. Dropping the rows with NaNs, or replacing NaNs be it with 0s or say median/medium values in a column should be good enough in most of the cases. This is doable using popular libs, be it in Python or R, so imho there is no desperate need to have it inside affinityprop.

brainstorming:

There may be a more laborious way to fix NaN values (staying with the initial dataset of genes x samples):

  1. assume expression matrix (genes as rows, samples as columns)
  2. if any gene has a NaN in some column, drop that column from the matrix but store it in a separate one
  3. cluster all the remaining samples (columns) based on the similarity of the gene expression vectors
  4. go to the dropped samples matrix and order the columns from low->high number of NaNs
  5. take one column (query_col from here on)with the lowest num of NaNs, calculate the distance to clusters from step 3 (dropping the genes with NaNs) in that column
  6. after finding the closest cluster based on not NaN values, fill in NaNs in query_col with median values from that cluster
  7. do it for the all of the dropped columns
  8. cluster samples

I have no idea if this would give more accurate clustering of samples/genes. Still, it should be better for fixing NaNs in cases where some gene X in cluster sample Y has expression values far from median for the whole set.

Best,

Darek Kedra

cjneely10 commented 2 years ago

Hi @darked89,

If we consider a much smaller space (2D for example), we can image a scenario in which two clusters are present at similar x-values, but spaced on top of each other along the y-axis. In this case, attempting to calculate similarity as you have laid out would create situations in which we will either 1) arbitrarily choose a cluster to assign the value, or 2) splits the difference of the y-coordinate, potentially adding a third cluster.

Ultimately, the nan reduces the row's dimensions, and these cases could extrapolate to higher dimensions where we have a plane/hyperplane of potential clusters for each data point, so I don't think this is the way to go.


I have implemented a "warn-but-allow" system in this repo's v0.3.0-alpha branch. This version allows users to include nan values in their input (so long as the row is not all nan).

I have added nan-safe similarity calculations that drop nan values prior to computing its final calculations

let mut row_diff = a - b;  // nan values may be present in each row
row_diff.map_inplace(|_a| *_a = (*_a).powi(2));  // nan values may be present
row_diff.mapv_inplace(|v| if v.is_nan() { 0.0 } else { v });  // Any nan values are replaced by 0
-1.0 * row_diff.sum()  // Total similarity does not include comparisons from nan, since these are all 0

I would be interested to see how this impacts predictions. In the tests I have added, I note that the predictive ability drops when high percentages of rows have nan in their input.

There is a small tradeoff in that a value like 0 could potentially be a large total similarity. An argument could be made to replace the dropped-nan values with the preference than with 0.

For now, this version will remain in the v0.3.0-alpha branch. It is installable with the command:

cargo install --git https://github.com/cjneely10/affinityprop --branch v0.3.0-alpha

I'd be curious to see how this impact real-world data.

Thanks, Chris