BaselAbujamous / clust

Automatic and optimised consensus clustering of one or more heterogeneous datasets
Other
161 stars 36 forks source link

Warnings during data pre-processing #19

Closed apcamargo closed 4 years ago

apcamargo commented 5 years ago

Hi Basel,

I'm trying to use Clust in a count matrix with 28582 rows and 84 columns (excluding row names and column names), and I'm getting some warnings during the pre-processing step. The results seem normal.

This is the first time I'm getting these warnings. They didn't show up in any of my previous analysis.

I'm using NumPy 1.15.4.

count_matrix.tsv.zip

(clust) apcamargo@elementaryos:~/Documents/Clust$ python clust.py ../Data/count_matrix.tsv 

/===========================================================================\
|                                   Clust                                   |
|    (Optimised consensus clustering of multiple heterogenous datasets)     |
|           Python package version 1.8.9 (2018) Basel Abu-Jamous            |
+---------------------------------------------------------------------------+
| Analysis started at: Sunday 25 November 2018 (16:33:08)                   |
| 1. Reading dataset(s)                                                     |
| 2. Data pre-processing                                                    |
/home/apcamargo/Documents/Clust/clust/scripts/preprocess_data.py:458: RuntimeWarning: overflow encountered in power
  Xnew[l][ogi] = np.log2(np.sum(np.power(2.0, Xloc[l][np.in1d(OGsDatasets[l], og)]), axis=0))
|  - Automatic normalisation mode (default in v1.7.0+).                     |
|    Clust automatically normalises your dataset(s).                        |
|    To switch it off, use the `-n 0` option (not recommended).             |
|    Check https://github.com/BaselAbujamous/clust for details.             |
/home/apcamargo/anaconda3/envs/clust/lib/python2.7/site-packages/numpy/core/_methods.py:117: RuntimeWarning: invalid value encountered in subtract
  x = asanyarray(arr - arrmean)
/home/apcamargo/anaconda3/envs/clust/lib/python2.7/site-packages/numpy/core/function_base.py:133: RuntimeWarning: invalid value encountered in multiply
  y *= step
/home/apcamargo/Documents/Clust/clust/scripts/preprocess_data.py:85: RuntimeWarning: invalid value encountered in less
  return np.sum(X < v) * 1.0 / ds.numel(X)
/home/apcamargo/Documents/Clust/clust/scripts/numeric.py:102: RuntimeWarning: invalid value encountered in subtract
  return np.subtract(Xloc.transpose(), V).transpose()
|  - Flat expression profiles filtered out (default in v1.7.0+).            |
|    To switch it off, use the --no-fil-flat option (not recommended).      |
|    Check https://github.com/BaselAbujamous/clust for details.             |
| 3. Seed clusters production (the Bi-CoPaM method)                         |
| 10%                                                                       |
| 20%                                                                       |
| 30%                                                                       |
| 40%                                                                       |
| 50%                                                                       |
| 60%                                                                       |
| 70%                                                                       |
| 80%                                                                       |
| 90%                                                                       |
| 100%                                                                      |
| 4. Cluster evaluation and selection (the M-N scatter plots technique)     |
| 10%                                                                       |
| 20%                                                                       |
| 30%                                                                       |
| 40%                                                                       |
| 50%                                                                       |
| 60%                                                                       |
| 70%                                                                       |
| 80%                                                                       |
| 90%                                                                       |
| 100%                                                                      |
| 5. Cluster optimisation and completion                                    |
| 6. Saving results in                                                      |
| /home/apcamargo/Documents/Clust/Results_25_Nov_18        |
+---------------------------------------------------------------------------+
| Analysis finished at: Sunday 25 November 2018 (16:57:40)                  |
| Total time consumed: 0 hours, 24 minutes, and 32 seconds                  |
|                                                                           |
\===========================================================================/

/===========================================================================\
|                              RESULTS SUMMARY                              |
+---------------------------------------------------------------------------+
| Clust received 1 dataset with 28582 unique genes. After filtering, 28127  |
| genes made it to the clustering step. Clust generated 1 clusters of       |
| genes, which in total include 44 genes. The smallest cluster includes 44  |
| genes, the largest cluster includes 44 genes, and the average cluster     |
| size is 44.0 genes.                                                       |
+---------------------------------------------------------------------------+
|                                 Citation                                  |
|                                 ~~~~~~~~                                  |
| When publishing work that uses Clust, please include this citation:       |
| Basel Abu-Jamous and Steven Kelly (2018) Clust: automatic extraction of   |
| optimal co-expressed gene clusters from gene expression data. Genome      |
| Biology 19:172; doi: https://doi.org/10.1186/s13059-018-1536-8.           |
+---------------------------------------------------------------------------+
| For enquiries contact:                                                    |
| Basel Abu-Jamous                                                          |
| Department of Plant Sciences, University of Oxford                        |
| basel.abujamous@plants.ox.ac.uk                                           |
| baselabujamous@gmail.com                                                  |
\===========================================================================/
BaselAbujamous commented 5 years ago

Thanks for being helpful as always and reporting this.

This happened because most of the data values are zeros, therefore clust found that more than 98% of the data are below the value of 30.0, concluding that the data is in log-scale. Clust treats the <2% of data larger than 30.0 as outliers in this case. This causes clust sometimes to calculate 2.0 to the power of the values, which throws this warning for the large values in the dataset.

I implemented a quick fix to overcome this problem, and have now released a new version v1.8.10 with this fix, in addition to your previous contribution of plots transparency, README edits, and some other minor fixes that I did.

Thanks again for your feedback!

apcamargo commented 5 years ago

Thanks!

In any case, I think it's best if I disable the automatic normalization then as the data is zero inflated, not in log-scale.