BaselAbujamous / clust

Automatic and optimised consensus clustering of one or more heterogeneous datasets
Other
161 stars 36 forks source link

Data pre-processing, AttributeError: 'float' object has no attribute 'replace' #29

Closed johnsolk closed 5 years ago

johnsolk commented 5 years ago

Hello @BaselAbujamous, thank you for providing this capacity for looking at gene expression data from multiple species! This is perfect for my project, and I'm very excited about clust.

I ran into a problem with the following command using Orthofinder output:

clust species-expression -d 17 -r species_replicates -m Orthogroups.csv 

The error is below. Can you please tell me if there is a problem with my formatting?

Here are the files used for this run:

curl -L https://osf.io/6f4yn/download -o Orthogroups.csv
curl -L https://osf.io/cbfst/download -o species_replicates
curl -L https://osf.io/sx546/download -o species-expression.tar.gz
tar -xvzf species-expression.tar.gz

Output error:

/===========================================================================\
|                                   Clust                                   |
|    (Optimised consensus clustering of multiple heterogenous datasets)     |
|           Python package version 1.8.10 (2018) Basel Abu-Jamous           |
+---------------------------------------------------------------------------+
| Analysis started at: Friday 25 January 2019 (00:16:46)                    |
| 1. Reading dataset(s)                                                     |
| 2. Data pre-processing                                                    |
Traceback (most recent call last):
  File "/opt/miniconda3/envs/run_clust/bin/clust", line 11, in <module>
    sys.exit(main())
  File "/opt/miniconda3/envs/run_clust/lib/python2.7/site-packages/clust/__main__.py", line 98, in main
    args.cs, args.np, args.optimisation, args.q3s, args.basemethods, args.deterministic)
  File "/opt/miniconda3/envs/run_clust/lib/python2.7/site-packages/clust/clustpipeline.py", line 97, in clustpipeline
    OGsIncludedIfAtLeastInDatasets=OGsIncludedIfAtLeastInDatasets)
  File "/opt/miniconda3/envs/run_clust/lib/python2.7/site-packages/clust/scripts/preprocess_data.py", line 436, in calculateGDMandUpdateDatasets
    OGsFirstColMap, delimGenesInMap)
  File "/opt/miniconda3/envs/run_clust/lib/python2.7/site-packages/clust/scripts/preprocess_data.py", line 397, in mapGenesToCommonIDs
    Maploc[i, j] = re.split(delimGenesInMap, Maploc[i, j].replace('.', 'thisisadot').replace('-', 'thisisadash').replace('/', 'thisisaslash'))
AttributeError: 'float' object has no attribute 'replace'

On an Ubuntu 18.04 instance, Conda py2.7 environment

# packages in environment at /opt/miniconda3/envs/run_clust:
#
# Name                    Version                   Build  Channel
_r-mutex                  1.0.0               anacondar_1    r
backports.functools-lru-cache 1.5                       <pip>
binutils_impl_linux-64    2.31.1               h6176602_1    conda-forge
binutils_linux-64         2.31.1               h6176602_3    conda-forge
blas                      1.0                         mkl  
blast                     2.5.0                hc0b0e79_3    bioconda
boost                     1.69.0          py27h8619c78_1000    conda-forge
boost-cpp                 1.69.0            h11c811c_1000    conda-forge
bwidget                   1.9.11                        1  
bzip2                     1.0.6             h14c3975_1002    conda-forge
ca-certificates           2018.11.29           ha4d7672_0    conda-forge
cairo                     1.14.12              h8948797_3  
certifi                   2018.11.29            py27_1000    conda-forge
clust                     1.8.10                    <pip>
curl                      7.63.0            h646f8bb_1000    conda-forge
cycler                    0.10.0                    <pip>
diamond                   0.9.21                        1    bioconda
dlcpar                    1.0              py27h24bf2e0_1    bioconda
fastme                    2.1.5                         0    bioconda
fasttree                  2.1.10               h470a237_2    bioconda
fontconfig                2.13.0               h9420a91_0  
freetype                  2.9.1             h94bbf69_1005    conda-forge
fribidi                   1.0.5             h14c3975_1000    conda-forge
gawk                      4.2.1             h14c3975_1000    conda-forge
gcc_impl_linux-64         7.3.0                habb00fd_1    conda-forge
gcc_linux-64              7.3.0                h553295d_3    conda-forge
gettext                   0.19.8.1          h9745a5d_1001    conda-forge
gfortran_impl_linux-64    7.3.0                hdf63c60_1  
gfortran_linux-64         7.3.0                h553295d_3  
glib                      2.56.2            had28632_1001    conda-forge
graphite2                 1.3.13            hf484d3e_1000    conda-forge
gsl                       2.4                  h14c3975_4  
gxx_impl_linux-64         7.3.0                hdf63c60_1    conda-forge
gxx_linux-64              7.3.0                h553295d_3    conda-forge
harfbuzz                  1.9.0             he243708_1001    conda-forge
icu                       58.2              hf484d3e_1000    conda-forge
intel-openmp              2019.1                      144  
iqtree                    1.6.9                he941832_0    bioconda
joblib                    0.13.1                    <pip>
jpeg                      9c                h14c3975_1001    conda-forge
kiwisolver                1.0.1                     <pip>
krb5                      1.16.3            hc83ff2d_1000    conda-forge
libcurl                   7.63.0            h01ee5af_1000    conda-forge
libedit                   3.1.20170329      hf8c457e_1001    conda-forge
libffi                    3.2.1             hf484d3e_1005    conda-forge
libgcc                    7.2.0                h69d50b8_2    conda-forge
libgcc-ng                 7.3.0                hdf63c60_0    conda-forge
libgfortran-ng            7.3.0                hdf63c60_0  
libiconv                  1.15              h14c3975_1004    conda-forge
libpng                    1.6.36            h84994c4_1000    conda-forge
libssh2                   1.8.0             h1ad7b7a_1003    conda-forge
libstdcxx-ng              7.3.0                hdf63c60_0    conda-forge
libtiff                   4.0.10            h648cc4a_1001    conda-forge
libuuid                   1.0.3                         1    conda-forge
libxcb                    1.13              h14c3975_1002    conda-forge
libxml2                   2.9.8             h143f9aa_1005    conda-forge
llvm-meta                 7.0.0                         0    conda-forge
mafft                     7.407                         0    bioconda
make                      4.2.1             h14c3975_2004    conda-forge
matplotlib                2.2.3                     <pip>
mcl                       14.137          pl526h470a237_4    bioconda
mkl                       2019.1                      144  
mkl_fft                   1.0.10           py27h470a237_1    conda-forge
mkl_random                1.0.2                    py27_0    conda-forge
mmseqs2                   7.4e23d              h21aa3a5_1    bioconda
muscle                    3.8.1551             h2d50403_3    bioconda
ncurses                   6.1               hf484d3e_1002    conda-forge
numpy                     1.15.4           py27h7e9f1db_0  
numpy                     1.16.0                    <pip>
numpy-base                1.15.4           py27hde5b4d6_0  
openmp                    7.0.0                h2d50403_0    conda-forge
openssl                   1.0.2p            h14c3975_1002    conda-forge
orthofinder               2.2.7                         0    bioconda
pandas                    0.23.4                    <pip>
pango                     1.42.4               h049681c_0  
pcre                      8.42                 h439df22_0  
perl                      5.26.2            h14c3975_1000    conda-forge
pip                       18.1                  py27_1000    conda-forge
pixman                    0.34.0            h14c3975_1003    conda-forge
portalocker               1.3.0                     <pip>
pthread-stubs             0.4               h14c3975_1001    conda-forge
pyparsing                 2.3.1                     <pip>
python                    2.7.15            h938d71a_1006    conda-forge
python-dateutil           2.7.5                     <pip>
pytz                      2018.9                    <pip>
r-base                    3.5.1                h1e0a451_2    r
raxml                     8.2.12               h470a237_0    bioconda
readline                  7.0               hf8c457e_1001    conda-forge
scikit-learn              0.20.2                    <pip>
scipy                     1.2.0                     <pip>
scipy                     1.1.0            py27h7c811a0_2  
setuptools                40.6.3                   py27_0    conda-forge
six                       1.12.0                    <pip>
sklearn                   0.0                       <pip>
sompy                     0.1.1                     <pip>
sqlite                    3.26.0            h67949de_1000    conda-forge
subprocess32              3.5.3                     <pip>
tk                        8.6.9             h84994c4_1000    conda-forge
tktable                   2.10                 h14c3975_0  
wheel                     0.32.3                   py27_0    conda-forge
xorg-libxau               1.0.8             h14c3975_1006    conda-forge
xorg-libxdmcp             1.1.2             h14c3975_1007    conda-forge
xz                        5.2.4             h14c3975_1001    conda-forge
zlib                      1.2.11            h14c3975_1004    conda-forge
johnsolk commented 5 years ago

I've tried to code around this problem by assigning OG to transcript ID from OrthoFinder output by hand. There are still missing genes, and this error occurs. I think similar to the bug identified in #28.

Command:

clust species_expression -d 16 -r species_replicates

Output

/===========================================================================\
|                                   Clust                                   |
|    (Optimised consensus clustering of multiple heterogenous datasets)     |
|           Python package version 1.8.10 (2018) Basel Abu-Jamous           |
+---------------------------------------------------------------------------+
| Analysis started at: Friday 25 January 2019 (23:19:29)                    |
| 1. Reading dataset(s)                                                     |
| 2. Data pre-processing                                                    |
|  - Automatic normalisation mode (default in v1.7.0+).                     |
|    Clust automatically normalises your dataset(s).                        |
|    To switch it off, use the `-n 0` option (not recommended).             |
|    Check https://github.com/BaselAbujamous/clust for details.             |
Traceback (most recent call last):
  File "/opt/miniconda3/envs/run_clust/bin/clust", line 11, in <module>
    sys.exit(main())
  File "/opt/miniconda3/envs/run_clust/lib/python2.7/site-packages/clust/__main__.py", line 98, in main
    args.cs, args.np, args.optimisation, args.q3s, args.basemethods, args.deterministic)
  File "/opt/miniconda3/envs/run_clust/lib/python2.7/site-packages/clust/clustpipeline.py", line 102, in clustpipeline
    filteringtype=filteringtype, filterflat=filflat, params=None, datafiles=datafiles)
  File "/opt/miniconda3/envs/run_clust/lib/python2.7/site-packages/clust/scripts/preprocess_data.py", line 675, in preprocess
    (Xproc[l], codes) = normaliseSampleFeatureMat(Xproc[l], normaliseloc[l])
  File "/opt/miniconda3/envs/run_clust/lib/python2.7/site-packages/clust/scripts/preprocess_data.py", line 273, in normaliseSampleFeatureMat
    Xout, codesi = normaliseSampleFeatureMat(Xout, type[i])
  File "/opt/miniconda3/envs/run_clust/lib/python2.7/site-packages/clust/scripts/preprocess_data.py", line 363, in normaliseSampleFeatureMat
    codes = autoNormalise(Xout)
  File "/opt/miniconda3/envs/run_clust/lib/python2.7/site-packages/clust/scripts/preprocess_data.py", line 213, in autoNormalise
    Xl = normaliseSampleFeatureMat(Xloc, [3])[0]  # index 1  (Xloc, i.e. original X is index 0)
  File "/opt/miniconda3/envs/run_clust/lib/python2.7/site-packages/clust/scripts/preprocess_data.py", line 273, in normaliseSampleFeatureMat
    Xout, codesi = normaliseSampleFeatureMat(Xout, type[i])
  File "/opt/miniconda3/envs/run_clust/lib/python2.7/site-packages/clust/scripts/preprocess_data.py", line 295, in normaliseSampleFeatureMat
    Xout[ind1] = fixnans(Xout[ind1])
  File "/opt/miniconda3/envs/run_clust/lib/python2.7/site-packages/clust/scripts/preprocess_data.py", line 70, in fixnans
    sumnans = sum(isnan(Xinloc[i]))
TypeError: 'bool' object is not iterable
BaselAbujamous commented 5 years ago

Thanks again, Lisa, for reporting this bug. Found the issue and fixed it. Try install clust version 1.8.11 (the latest version) and it should work :)

Looking forward to your next blog post :)

If any further problems appear please let me know.

All the best Basel

BaselAbujamous commented 5 years ago

Hi again, I have tested clust on your data, which is taking a long time, but it's okay.

Clust exited at another error which is due to the fact that one of the 17 datasets has one condition only, which is the "F_notti.tsv" dataset. The replicates file shows that this dataset has two samples that are replicates of a single condition. So when the two replicates are summarised, the dataset will have a single column of data. Clustering doesn't really make sense over a single condition (single dimension). This error is explained in issue #14 .

Possible solutions:

  1. To remove the row related to this dataset in the replicates file, so clust will automatically treat the two samples in this dataset as two independent samples (I am testing it now).
  2. To exclude this particular dataset from analysis, as it does not have sufficient complexity for cluster analysis.

Best! Basel

johnsolk commented 5 years ago

Thanks, @BaselAbujamous! I did update to version 1.8.11. However, I get this error below now.

Here are re-formatted files with OG assigned by hand, if you would like to take a look:

curl -L https://osf.io/cbfst/download -o species_replicates
curl -L https://osf.io/muxaf/download -o species-expression-OG.tar.gz
tar -xvzf species-expression.tar.gz

Command:

 clust species_expression -d 16 -r species_replicates

Output:

/===========================================================================\
|                                   Clust                                   |
|    (Optimised consensus clustering of multiple heterogenous datasets)     |
|           Python package version 1.8.11 (2018) Basel Abu-Jamous           |
+---------------------------------------------------------------------------+
| Analysis started at: Saturday 26 January 2019 (19:21:30)                  |
| 1. Reading dataset(s)                                                     |
| 2. Data pre-processing                                                    |
|  - Automatic normalisation mode (default in v1.7.0+).                     |
|    Clust automatically normalises your dataset(s).                        |
|    To switch it off, use the `-n 0` option (not recommended).             |
|    Check https://github.com/BaselAbujamous/clust for details.             |
Traceback (most recent call last):
  File "/opt/miniconda3/envs/run_clust/bin/clust", line 11, in <module>
    sys.exit(main())
  File "/opt/miniconda3/envs/run_clust/lib/python2.7/site-packages/clust/__main__.py", line 98, in main
    args.cs, args.np, args.optimisation, args.q3s, args.basemethods, args.deterministic)
  File "/opt/miniconda3/envs/run_clust/lib/python2.7/site-packages/clust/clustpipeline.py", line 102, in clustpipeline
    filteringtype=filteringtype, filterflat=filflat, params=None, datafiles=datafiles)
  File "/opt/miniconda3/envs/run_clust/lib/python2.7/site-packages/clust/scripts/preprocess_data.py", line 675, in preprocess
    (Xproc[l], codes) = normaliseSampleFeatureMat(Xproc[l], normaliseloc[l])
  File "/opt/miniconda3/envs/run_clust/lib/python2.7/site-packages/clust/scripts/preprocess_data.py", line 273, in normaliseSampleFeatureMat
    Xout, codesi = normaliseSampleFeatureMat(Xout, type[i])
  File "/opt/miniconda3/envs/run_clust/lib/python2.7/site-packages/clust/scripts/preprocess_data.py", line 363, in normaliseSampleFeatureMat
    codes = autoNormalise(Xout)
  File "/opt/miniconda3/envs/run_clust/lib/python2.7/site-packages/clust/scripts/preprocess_data.py", line 213, in autoNormalise
    Xl = normaliseSampleFeatureMat(Xloc, [3])[0]  # index 1  (Xloc, i.e. original X is index 0)
  File "/opt/miniconda3/envs/run_clust/lib/python2.7/site-packages/clust/scripts/preprocess_data.py", line 273, in normaliseSampleFeatureMat
    Xout, codesi = normaliseSampleFeatureMat(Xout, type[i])
  File "/opt/miniconda3/envs/run_clust/lib/python2.7/site-packages/clust/scripts/preprocess_data.py", line 295, in normaliseSampleFeatureMat
    Xout[ind1] = fixnans(Xout[ind1])
  File "/opt/miniconda3/envs/run_clust/lib/python2.7/site-packages/clust/scripts/preprocess_data.py", line 70, in fixnans
    sumnans = sum(isnan(Xinloc[i]))
TypeError: 'bool' object is not iterable
BaselAbujamous commented 5 years ago

Hi,

The problem that was fixed in the new version was that related to reading the missing orthologues from the orthogroups file. However, this other error that you have just reported is the one that I talked about in my last comment above related to the dataset "F_notti.tsv".

Your data seems to properly test clust for multiple species! I like that! These iterations will make it robust.

Thanks and all the best! Basel

BaselAbujamous commented 5 years ago

Hi again :)

I have found another bug related to analysing your data. I believe I have fixed it. It is being tested on your data now before releasing version 1.8.12.

BaselAbujamous commented 5 years ago

Hi one more time.

Now in version 1.8.12 another bug has been fixed. This was caused by the fact that I removed the line on the "F_notti.tsv" dataset from the replicates file (for the reasons explained few comments above).

It should work now. Happy to follow it up with any further questions or discussions indeed.

All the best Basel

BaselAbujamous commented 5 years ago

Hi. I believe this issue has been resolved so I am closing it. Please feel free to reopen it or to submit any other issue.

All the best Basel