BaselAbujamous / clust

Automatic and optimised consensus clustering of one or more heterogeneous datasets
Other
160 stars 35 forks source link

How to deal with missing data values #47

Closed andhartm closed 5 years ago

andhartm commented 5 years ago

First of all thanks for this great tool! My data files have missing values for quite a number of genes (set to N/A). When running these data files I get an error message but clust is still finishing the run. Removing all rows containing N/A values works but I don't want to loose all the data.

Error message: c:[ ].py:19: RuntimeWarning: invalid value encountered in greater I = np.bitwise_and(~isnan(X), X>0) c:[ ].py:465: RuntimeWarning: invalid value encountered in power Xnew[l][ogi] = np.log2(np.sum(np.power(2.0, Xloc[l][np.in1d(OGsDatasets[l], og)]), axis=0))

Data: GeneID | Treatment 1 | Treatment 2 | Treatment 3 | Treatment 4 | Treatment 5 | Treatment 6 1 | 4.273093893 | 0 | 1.946402008 | 1.374515554 | 2.655817399 | 5.267132206 2 | 5.956198005 | N/A | N/A | N/A | N/A | 5.266617765 3 | N/A | 0 | N/A | 0 | N/A | 5.264203631 4 | 0 | 0 | N/A | 0 | N/A | 5.261192058 6 | 3.96170082 | 1.7741793 | 0 | 1.612520247 | 3.915867084 | 5.259103225 7 | 5.118588008 | 0 | 3.888582101 | 0 | 0 | 5.257160244 8 | 4.393112039 | 0 | N/A | N/A | N/A | 5.252373101 …

How to deal with this issue (N/A and leaving them blank gives the same error message)? How is clust dealing with this data? Does clust automatically remove these rows?

Thanks for your help!

BaselAbujamous commented 5 years ago

Hi Andreas

Thanks for your question. This is a "warning" that does not affect clust's run or results. I will try to suppress it in future releases.

Clust interpolates those N/A values with spline interpolation. If you feel that some rows have too many N/A values, it might make sense to filter them out.

I hope this answers your questions.

Please let me know if you have other questions. Best wishes, Basel

andhartm commented 5 years ago

Hi Basel,

Thanks for the quick response. This answers my questions! Will try to filter my data a bit more.

Cheers, Andreas