BaselAbujamous / clust

Automatic and optimised consensus clustering of one or more heterogeneous datasets
Other
160 stars 35 forks source link

Type conversion errors when reading replicates file #76

Open ijhoskins opened 2 years ago

ijhoskins commented 2 years ago

Hello,

I have been unable to include alphanumeric text in fields 2 or 3 of the replicates file without encountering an error:

| Analysis started at: Thursday 12 May 2022 (15:44:38)                      |
| 1. Reading dataset(s)                                                     |
Traceback (most recent call last):
  File "pandas/_libs/parsers.pyx", line 1113, in pandas._libs.parsers.TextReader._convert_tokens
TypeError: Cannot cast array data from dtype('O') to dtype('float64') according to the rule 'safe'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/anaconda3/envs/clust/bin/clust", line 10, in <module>
    sys.exit(main())
  File "/opt/anaconda3/envs/clust/lib/python3.10/site-packages/clust/__main__.py", line 102, in main
    clustpipeline.clustpipeline(args.datapath, args.m, args.r, args.n, args.o, args.K, args.t,
  File "/opt/anaconda3/envs/clust/lib/python3.10/site-packages/clust/clustpipeline.py", line 86, in clustpipeline
    (X, replicates, Genes, datafiles) = io.readDatasetsFromDirectory(datapath, delimiter='\t| |, |; |,|;', skiprows=1, skipcolumns=1,
  File "/opt/anaconda3/envs/clust/lib/python3.10/site-packages/clust/scripts/io.py", line 46, in readDatasetsFromDirectory
    datafilesread = readDataFromFiles(datafileswithpath, delimiter, float, skiprows, skipcolumns, returnSkipped)
  File "/opt/anaconda3/envs/clust/lib/python3.10/site-packages/clust/scripts/io.py", line 204, in readDataFromFiles
    X[l] = pdreadcsv_regexdelim(datafiles[l], delimiter=delimiter, dtype=dtype, skiprows=skiprows,
  File "/opt/anaconda3/envs/clust/lib/python3.10/site-packages/clust/scripts/io.py", line 239, in pdreadcsv_regexdelim
    result = pd.read_csv(StringIO('\n'.join(re.sub(delimiter, '\t', str(x)) for x in f)),
  File "/opt/anaconda3/envs/clust/lib/python3.10/site-packages/pandas/util/_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)
  File "/opt/anaconda3/envs/clust/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 680, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/opt/anaconda3/envs/clust/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 581, in _read
    return parser.read(nrows)
  File "/opt/anaconda3/envs/clust/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1254, in read
    index, columns, col_dict = self._engine.read(nrows)
  File "/opt/anaconda3/envs/clust/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 225, in read
    chunks = self._reader.read_low_memory(nrows)
  File "pandas/_libs/parsers.pyx", line 805, in pandas._libs.parsers.TextReader.read_low_memory
  File "pandas/_libs/parsers.pyx", line 883, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 1026, in pandas._libs.parsers.TextReader._convert_column_data
  File "pandas/_libs/parsers.pyx", line 1119, in pandas._libs.parsers.TextReader._convert_tokens
ValueError: could not convert string to float: 'B'

Here is the example problematic replicates file:

clust_bio_comb.txt      A       1.x,1.y
clust_bio_comb.txt      B       2.x,2.y
clust_bio_comb.txt      C       3.x,3.y
clust_bio_comb.txt      D       4.x,4.y
clust_bio_comb.txt      E       5.x,5.y

If I then convert all names in fields 2 and 3 to integers, I run into another error:

| Analysis started at: Thursday 12 May 2022 (15:50:20)                      |
| 1. Reading dataset(s)                                                     |
Traceback (most recent call last):
  File "/opt/anaconda3/envs/clust/bin/clust", line 10, in <module>
    sys.exit(main())
  File "/opt/anaconda3/envs/clust/lib/python3.10/site-packages/clust/__main__.py", line 102, in main
    clustpipeline.clustpipeline(args.datapath, args.m, args.r, args.n, args.o, args.K, args.t,
  File "/opt/anaconda3/envs/clust/lib/python3.10/site-packages/clust/clustpipeline.py", line 92, in clustpipeline
    (replicatesIDs, conditions) = io.readReplicates(replicatesfile, datapath, datafiles, replicates)
  File "/opt/anaconda3/envs/clust/lib/python3.10/site-packages/clust/scripts/io.py", line 125, in readReplicates
    conditions[c] = line[1:]
TypeError: 'filter' object is not subscriptable

https://github.com/BaselAbujamous/clust/issues/62

@BaselAbujamous do you have any formatting recommendations to bypass these errors?

yichangyu commented 1 year ago

Same here, even using the example data.

BaselAbujamous commented 1 year ago

Hello both. Thanks for reporting this. Have you tried the most recent release?

On Mon, 12 Dec 2022 at 10:11, Changyu Yi @.***> wrote:

Same here, even using the example data.

— Reply to this email directly, view it on GitHub https://github.com/BaselAbujamous/clust/issues/76#issuecomment-1345998236, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJFLQ4BKM73S2BGTZDEDMTWM3F23ANCNFSM5VZOU6HA . You are receiving this because you were mentioned.Message ID: @.***>

yichangyu commented 1 year ago

Hello both. Thanks for reporting this. Have you tried the most recent release? On Mon, 12 Dec 2022 at 10:11, Changyu Yi @.> wrote: Same here, even using the example data. — Reply to this email directly, view it on GitHub <#76 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJFLQ4BKM73S2BGTZDEDMTWM3F23ANCNFSM5VZOU6HA . You are receiving this because you were mentioned.Message ID: @.>

Hi Basel,

Yes, I used the latest version (1.17.0 ). I was able to run it a year ago in my old laptop, today I tried using the laptop and I got the same error.

Thanks

BaselAbujamous commented 1 year ago

Hello Changyu,

The latest version is 1.18.1. Errors occurred because of some things that some dependency packages (e.g. scipy) changed in their recent versions making clust break because of their updates. So the recent release 1.18.1 was patched to overcome these.

Please let me know if this solves it or not.

Best, Basel

On Mon, 12 Dec 2022 at 12:41, Changyu Yi @.***> wrote:

Hello both. Thanks for reporting this. Have you tried the most recent release? … <#m-770256276532528270> On Mon, 12 Dec 2022 at 10:11, Changyu Yi @.> wrote: Same here, even using the example data. — Reply to this email directly, view it on GitHub <#76 (comment) https://github.com/BaselAbujamous/clust/issues/76#issuecomment-1345998236>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJFLQ4BKM73S2BGTZDEDMTWM3F23ANCNFSM5VZOU6HA https://github.com/notifications/unsubscribe-auth/AAJFLQ4BKM73S2BGTZDEDMTWM3F23ANCNFSM5VZOU6HA . You are receiving this because you were mentioned.Message ID: @.>

Hi Basel,

Yes, I used the latest version (1.17.0 ). I was able to run it a year ago in my old laptop, today I tried using the laptop and I got the same error.

Thanks

— Reply to this email directly, view it on GitHub https://github.com/BaselAbujamous/clust/issues/76#issuecomment-1346172595, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJFLQ2MQQGLD7RBSABSI4TWM3XL3ANCNFSM5VZOU6HA . You are receiving this because you were mentioned.Message ID: @.***>

yichangyu commented 1 year ago

Hello Changyu, The latest version is 1.18.1. Errors occurred because of some things that some dependency packages (e.g. scipy) changed in their recent versions making clust break because of their updates. So the recent release 1.18.1 was patched to overcome these. Please let me know if this solves it or not. Best, Basel On Mon, 12 Dec 2022 at 12:41, Changyu Yi @.> wrote: Hello both. Thanks for reporting this. Have you tried the most recent release? … <#m-770256276532528270> On Mon, 12 Dec 2022 at 10:11, Changyu Yi @.> wrote: Same here, even using the example data. — Reply to this email directly, view it on GitHub <#76 (comment) <#76 (comment)>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJFLQ4BKM73S2BGTZDEDMTWM3F23ANCNFSM5VZOU6HA https://github.com/notifications/unsubscribe-auth/AAJFLQ4BKM73S2BGTZDEDMTWM3F23ANCNFSM5VZOU6HA . You are receiving this because you were mentioned.Message ID: @.> Hi Basel, Yes, I used the latest version (1.17.0 ). I was able to run it a year ago in my old laptop, today I tried using the laptop and I got the same error. Thanks — Reply to this email directly, view it on GitHub <#76 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJFLQ2MQQGLD7RBSABSI4TWM3XL3ANCNFSM5VZOU6HA . You are receiving this because you were mentioned.Message ID: @.>

Hi Basel,

I tried use conda to install clust, but the latest it can install is 1.17.0. And then I tried to use sudo pip install clust, it can only install 1.18.0. then I tried sudo pip install clust==1.18.1, it returns an error No matching distribution found for clust==1.18.1. Could you please help to fix this?

Thanks Changyu

yichangyu commented 1 year ago

Hi Basel,

I tried the fourth install method as below

wget  https://github.com/BaselAbujamous/clust/releases/download/v1.18.1/clust-1.18.1.tar.gz
sudo tar -xvzf clust-1.18.1.tar.gz
sudo python3 clust-1.18.1/clust.py . -r Replicates.txt

I still get the same error as below, please note that the output still show the version is 1.18.0 as below

/===========================================================================\
|                                   Clust                                   |
|    (Optimised consensus clustering of multiple heterogenous datasets)     |
|           Python package version 1.18.0 (2022) Basel Abu-Jamous           |
+---------------------------------------------------------------------------+
| Analysis started at: Tuesday 13 December 2022 (14:27:34)                  |
| 1. Reading dataset(s)                                                     |
Traceback (most recent call last):
  File "pandas/_libs/parsers.pyx", line 1124, in pandas._libs.parsers.TextReader._convert_tokens
TypeError: Cannot cast array data from dtype('O') to dtype('float64') according to the rule 'safe'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "clust-1.18.1/clust.py", line 6, in <module>
    main(args)
  File "/mnt/c/Users/cyi/clust/clust-1.18.1/clust/__main__.py", line 102, in main
    clustpipeline.clustpipeline(args.datapath, args.m, args.r, args.n, args.o, args.K, args.t,
  File "/mnt/c/Users/cyi/clust/clust-1.18.1/clust/clustpipeline.py", line 86, in clustpipeline
    (X, replicates, Genes, datafiles) = io.readDatasetsFromDirectory(datapath, delimiter='\t| |, |; |,|;', skiprows=1, skipcolumns=1,
  File "/mnt/c/Users/cyi/clust/clust-1.18.1/clust/scripts/io.py", line 46, in readDatasetsFromDirectory
    datafilesread = readDataFromFiles(datafileswithpath, delimiter, float, skiprows, skipcolumns, returnSkipped)
  File "/mnt/c/Users/cyi/clust/clust-1.18.1/clust/scripts/io.py", line 204, in readDataFromFiles
    X[l] = pdreadcsv_regexdelim(datafiles[l], delimiter=delimiter, dtype=dtype, skiprows=skiprows,
  File "/mnt/c/Users/cyi/clust/clust-1.18.1/clust/scripts/io.py", line 239, in pdreadcsv_regexdelim
    result = pd.read_csv(StringIO('\n'.join(re.sub(delimiter, '\t', str(x)) for x in f)),
  File "/usr/local/lib/python3.8/dist-packages/pandas/util/_decorators.py", line 211, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pandas/util/_decorators.py", line 331, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers/readers.py", line 950, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers/readers.py", line 611, in _read
    return parser.read(nrows)
  File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers/readers.py", line 1778, in read
    ) = self._engine.read(  # type: ignore[attr-defined]
  File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers/c_parser_wrapper.py", line 230, in read
    chunks = self._reader.read_low_memory(nrows)
  File "pandas/_libs/parsers.pyx", line 808, in pandas._libs.parsers.TextReader.read_low_memory
  File "pandas/_libs/parsers.pyx", line 890, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 1037, in pandas._libs.parsers.TextReader._convert_column_data
  File "pandas/_libs/parsers.pyx", line 1130, in pandas._libs.parsers.TextReader._convert_tokens
ValueError: could not convert string to float: 'B'

The installed packages as below

attrs==19.3.0
Automat==0.8.0
blinker==1.4
certifi==2019.11.28
chardet==3.0.4
Click==7.0
cloud-init==22.4.2
colorama==0.4.3
command-not-found==0.3
configobj==5.0.6
constantly==15.1.0
contourpy==1.0.6
cryptography==2.8
cycler==0.11.0
dbus-python==1.2.16
distro==1.4.0
distro-info===0.23ubuntu1
entrypoints==0.3
fonttools==4.38.0
httplib2==0.14.0
hyperlink==19.0.0
idna==2.8
importlib-metadata==1.5.0
incremental==16.10.1
Jinja2==2.10.1
joblib==1.2.0
jsonpatch==1.22
jsonpointer==2.0
jsonschema==3.2.0
keyring==18.0.1
kiwisolver==1.4.4
language-selector==0.1
launchpadlib==1.10.13
lazr.restfulclient==0.14.2
lazr.uri==1.0.3
MarkupSafe==1.1.0
matplotlib==3.6.2
more-itertools==4.2.0
netifaces==0.10.4
numpy==1.23.5
oauthlib==3.1.0
packaging==22.0
pandas==1.5.2
pexpect==4.6.0
Pillow==9.3.0
portalocker==2.6.0
pyasn1==0.4.2
pyasn1-modules==0.2.1
PyGObject==3.36.0
PyHamcrest==1.9.0
PyJWT==1.7.1
pymacaroons==0.13.0
PyNaCl==1.3.0
pyOpenSSL==19.0.0
pyparsing==3.0.9
pyrsistent==0.15.5
pyserial==3.4
python-apt==2.0.0+ubuntu0.20.4.8
python-dateutil==2.8.2
python-debian===0.1.36ubuntu1
pytz==2022.6
PyYAML==5.3.1
requests==2.22.0
requests-unixsocket==0.2.0
scikit-learn==1.2.0
scipy==1.9.3
SecretStorage==2.3.1
service-identity==18.1.0
simplejson==3.16.0
six==1.14.0
sos==4.4
ssh-import-id==5.10
systemd-python==234
threadpoolctl==3.1.0
Twisted==18.9.0
ubuntu-advantage-tools==27.12
ufw==0.36
unattended-upgrades==0.1
urllib3==1.25.8
wadllib==1.3.3
zipp==1.0.0
zope.interface==4.7.1
PaulAirs commented 5 months ago

Solved it! The replicates file and the data file need to be called specifically. It helps to put the data in a specific subfolder. (Make sure to call both paths or cd to the parent folder first).

See the example code below where I had all the files in my downloads folder, vs a subdirectory for the data files....

FAILED VERSION - ALL IN SAME FOLDER

pma37@dhcp-10-248-206-95 Clust_Example % clust /Users/pma37/Downloads/Clust_Example/ -r Replicates.txt

/===========================================================================\
|                                   Clust                                   |
|    (Optimised consensus clustering of multiple heterogenous datasets)     |
|           Python package version 1.18.0 (2022) Basel Abu-Jamous           |
+---------------------------------------------------------------------------+
| Analysis started at: Thursday 07 March 2024 (10:52:06)                    |
| 1. Reading dataset(s)                                                     |
Traceback (most recent call last):
  File "parsers.pyx", line 1161, in pandas._libs.parsers.TextReader._convert_tokens
TypeError: Cannot cast array data from dtype('O') to dtype('float64') according to the rule 'safe'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.11/bin/clust", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/clust/__main__.py", line 102, in main
    clustpipeline.clustpipeline(args.datapath, args.m, args.r, args.n, args.o, args.K, args.t,
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/clust/clustpipeline.py", line 86, in clustpipeline
    (X, replicates, Genes, datafiles) = io.readDatasetsFromDirectory(datapath, delimiter='\t| |, |; |,|;', skiprows=1, skipcolumns=1,
                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/clust/scripts/io.py", line 46, in readDatasetsFromDirectory
    datafilesread = readDataFromFiles(datafileswithpath, delimiter, float, skiprows, skipcolumns, returnSkipped)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/clust/scripts/io.py", line 204, in readDataFromFiles
    X[l] = pdreadcsv_regexdelim(datafiles[l], delimiter=delimiter, dtype=dtype, skiprows=skiprows,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/clust/scripts/io.py", line 239, in pdreadcsv_regexdelim
    result = pd.read_csv(StringIO('\n'.join(re.sub(delimiter, '\t', str(x)) for x in f)),
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 1026, in read_csv
    return _read(filepath_or_buffer, kwds)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 626, in _read
    return parser.read(nrows)
           ^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 1923, in read
    ) = self._engine.read(  # type: ignore[attr-defined]
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 234, in read
    chunks = self._reader.read_low_memory(nrows)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "parsers.pyx", line 838, in pandas._libs.parsers.TextReader.read_low_memory
  File "parsers.pyx", line 921, in pandas._libs.parsers.TextReader._read_rows
  File "parsers.pyx", line 1066, in pandas._libs.parsers.TextReader._convert_column_data
  File "parsers.pyx", line 1167, in pandas._libs.parsers.TextReader._convert_tokens
ValueError: could not convert string to float: 'B'

WORKING VERSION:

pma37@dhcp-10-248-206-95 Clust_Example % clust /Users/pma37/Downloads/Clust_Example/Data -r Replicates.txt

/===========================================================================\
|                                   Clust                                   |
|    (Optimised consensus clustering of multiple heterogenous datasets)     |
|           Python package version 1.18.0 (2022) Basel Abu-Jamous           |
+---------------------------------------------------------------------------+
| Analysis started at: Thursday 07 March 2024 (10:52:46)                    |
| 1. Reading dataset(s)                                                     |
| 2. Data pre-processing                                                    |
|  - Automatic normalisation mode (default in v1.7.0+).                     |
|    Clust automatically normalises your dataset(s).                        |
|    To switch it off, use the `-n 0` option (not recommended).             |
|    Check https://github.com/BaselAbujamous/clust for details.             |
|  - Flat expression profiles filtered out (default in v1.7.0+).            |
|    To switch it off, use the --no-fil-flat option (not recommended).      |
|    Check https://github.com/BaselAbujamous/clust for details.             |
| 3. Seed clusters production (the Bi-CoPaM method)                         |
| 10%                                                                       |
| 20%                                                                       |
| 30%                                                                       |
| 40%                                                                       |
| 50%                                                                       |
| 60%                                                                       |
| 70%                                                                       |
| 80%                                                                       |
| 90%                                                                       |
| 100%                                                                      |
| 4. Cluster evaluation and selection (the M-N scatter plots technique)     |
| 10%                                                                       |
| 20%                                                                       |
| 30%                                                                       |
| 40%                                                                       |
| 50%                                                                       |
| 60%                                                                       |
| 70%                                                                       |
| 80%                                                                       |
| 90%                                                                       |
| 100%                                                                      |
| 5. Cluster optimisation and completion                                    |
| 6. Saving results in                                                      |
| /Users/pma37/Downloads/Clust_Example/Results_07_Mar_24_2                  |
| Eigengene computation is currently not supported for multiple datasets.   |
+---------------------------------------------------------------------------+
| Analysis finished at: Thursday 07 March 2024 (10:53:00)                   |
| Total time consumed: 0 hours, 0 minutes, and 13 seconds                   |
|                                                                           |
\===========================================================================/

/===========================================================================\
|                              RESULTS SUMMARY                              |
+---------------------------------------------------------------------------+
| Clust received 3 datasets with 9332 unique genes. After filtering, 9329   |
| genes made it to the clustering step. Clust generated 2 clusters of       |
| genes, which in total include 1601 genes. The smallest cluster includes   |
| 680 genes, the largest cluster includes 921 genes, and the average        |
| cluster size is 800 genes.                                                |
+---------------------------------------------------------------------------+
|                                 Citation                                  |
|                                 ~~~~~~~~                                  |
| When publishing work that uses Clust, please include this citation:       |
| Basel Abu-Jamous and Steven Kelly (2018) Clust: automatic extraction of   |
| optimal co-expressed gene clusters from gene expression data. Genome      |
| Biology 19:172; doi: https://doi.org/10.1186/s13059-018-1536-8.           |
+---------------------------------------------------------------------------+
| For enquiries contact:                                                    |
| Dr. Basel Abu-Jamous                                                      |
| baselabujamous@gmail.com                                                  |
\===========================================================================/