BaselAbujamous / clust

Automatic and optimised consensus clustering of one or more heterogeneous datasets
Other
163 stars 36 forks source link

Problems when clustering log-transformed two-colour micro-array data #30

Closed emvcaest closed 5 years ago

emvcaest commented 5 years ago

Hi Basel,

First of all, thanks for building this tool. I have already used this to process several RNASeq, which went effortless.

However, right now I would like to re-process public micro-array data available on GEO (such as https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE17237). I used the SOFT files formatted files to map the probes to the current gene annotation and reformat it in an expression matrix. ( available here: GSE17237.txt)

For this example, the expression values represent accordign to the SOFT file: Data were analyzed using the limma package and the R statistical data analysis program (R 2.7.1). Due to some spread in M-values data was scale normalized between arrays at each timepoint. Values in matrix table are given as log2 ratios (test/reference)

When I run clust using the normalisation option -n 6, or -n 0, I get the following error: /==================================================================\ | Clust | | (Optimised consensus clustering of multiple heterogenous datasets) | | Python package version 1.8.12 (2018) Basel Abu-Jamous | +---------------------------------------------------------------------------+ | Analysis started at: Thursday 07 February 2019 (17:38:45) | | 1. Reading dataset(s) | | 2. Data pre-processing | Traceback (most recent call last): File "/software/shared/apps/x86_64/clust/1.8.12/bin/clust", line 10, in sys.exit(main()) File "/shared/clssoft/apps/x86_64/clust/1.8.12/lib/python2.7/site-packages/clust/main.py", line 98, in main args.cs, args.np, args.optimisation, args.q3s, args.basemethods, args.deterministic) File "/shared/clssoft/apps/x86_64/clust/1.8.12/lib/python2.7/site-packages/clust/clustpipeline.py", line 97, in clustpipeline OGsIncludedIfAtLeastInDatasets=OGsIncludedIfAtLeastInDatasets) File "/shared/clssoft/apps/x86_64/clust/1.8.12/lib/python2.7/site-packages/clust/scripts/preprocess_data.py", line 465, in calculateGDMandUpdateDatasets Xnew[l][ogi] = np.log2(np.sum(np.power(2.0, Xloc[l][np.in1d(OGsDatasets[l], og)]), axis=0)) AttributeError: 'float' object has no attribute 'log2'

Could it be, there is an problem when the input data is already log-transformed?

Thanks in advance, Emmelien

BaselAbujamous commented 5 years ago

Hi Emmelien,

Thanks for using Clust and for your question. Problem identified :)

The header line in the data file GSE17237.txt starts as:

ID TAB TAB GSM431528 TAB GSM431529 TAB GSM431530 ... etc.

There are two TAB spaces after "ID" and before the title of the first column. So the method thinks your file has one more column than reality. Just remove this extra TAB after "ID" and see how things go :)

Best wishes! Basel

emvcaest commented 5 years ago

Hi Basel,

That was indeed the problem, thanks for the help and sorry for bothering you with such a stupid mistake on my side.

Emmelien

BaselAbujamous commented 5 years ago

No worries, Emmelien. So I will close this issue, and please feel free to come back with any other questions or issues.

All the best Basel