multiple datasets run into issues

Wen-Juan commented 5 years ago

Hi Basel, I try to run clusters for the same time points but with either females dataset or male dataset. For each dataset, it ran successfully alone, but it kept throwing me errors when I use one replicate file including both files.

My code: clust ~/input/TPM/ -n 101 3 4 -r ~/input/TPM/Ag_all_X0.txt -o ~/output/tpm/Ag_all/

The error messages:

/===========================================================================\ | Clust | | (Optimised consensus clustering of multiple heterogenous datasets) | | Python package version 1.8.10 (2018) Basel Abu-Jamous | +---------------------------------------------------------------------------+ | Analysis started at: Friday 04 January 2019 (21:11:02) | | 1. Reading dataset(s) | Traceback (most recent call last): File "/anaconda/anaconda/envs/python2/bin/clust", line 11, in sys.exit(main()) File "/anaconda/anaconda/envs/python2/lib/python2.7/site-packages/clust/main.py", line 98, in main args.cs, args.np, args.optimisation, args.q3s, args.basemethods, args.deterministic) File "/anaconda/anaconda/envs/python2/lib/python2.7/site-packages/clust/clustpipeline.py", line 84, in clustpipeline returnSkipped=True) File "/anaconda/anaconda/envs/python2/lib/python2.7/site-packages/clust/scripts/io.py", line 46, in readDatasetsFromDirectory datafilesread = readDataFromFiles(datafileswithpath, delimiter, float, skiprows, skipcolumns, returnSkipped) File "/anaconda/anaconda/envs/python2/lib/python2.7/site-packages/clust/scripts/io.py", line 193, in readDataFromFiles usecols=range(skipcolumns, ncols), na_filter=True, comments=comm) File "/anaconda/anaconda/envs/python2/lib/python2.7/site-packages/clust/scripts/io.py", line 228, in pdreadcsv_regexdelim delimiter='\t', dtype=dtype, header=-1, skiprows=skiprows, usecols=usecols, na_filter=na_filter, comment=comments).values File "/anaconda/anaconda/envs/python2/lib/python2.7/site-packages/pandas/io/parsers.py", line 678, in parser_f return _read(filepath_or_buffer, kwds) File "/anaconda/anaconda/envs/python2/lib/python2.7/site-packages/pandas/io/parsers.py", line 440, in _read parser = TextFileReader(filepath_or_buffer, kwds) File "/anaconda/anaconda/envs/python2/lib/python2.7/site-packages/pandas/io/parsers.py", line 787, in init self._make_engine(self.engine) File "/anaconda/anaconda/envs/python2/lib/python2.7/site-packages/pandas/io/parsers.py", line 1014, in _make_engine self._engine = CParserWrapper(self.f, self.options) File "/anaconda/anaconda/envs/python2/lib/python2.7/site-packages/pandas/io/parsers.py", line 1708, in init self._reader = parsers.TextReader(src, **kwds) File "pandas/_libs/parsers.pyx", line 542, in pandas._libs.parsers.TextReader.cinit pandas.errors.EmptyDataError: No columns to parse from file

Thanks for pointing out solutions.

Best, Wen-Juan

BaselAbujamous commented 5 years ago

Hi Wen-Juan

Thank you for using clust and for your question.

I can see that your replicates file is in the same folder of your datasets. Clust will think that this replicates file is a data file. If you keep nothing in ~/input/TPM/ except for the two data files, and for example put the replicates file in ~/input/Ag_all_X0.txt

If this doesn't work, please let me know.

Best wishes Basel

Wen-Juan commented 5 years ago

Hi Basel, Thanks for your answer.

Your suggestion was not working. I tried to only put the two files in a sub-folder within /TPM directory previously, it did not work either.

Sorry, I am not behind my computer today and cannot report you the error at the moment. Let me know if you need to see the error message. I will send you later.

Many thanks.

Wen-Juan

Send from iPhone, apologize for my brevity and typos.

On Jan 5, 2019, at 05:11, Basel Abu Jamous notifications@github.com wrote:

Hi Wen-Juan

Thank you for using clust and for your question.

I can see that your replicates file is in the same folder of your datasets. Clust will think that this replicates file is a data file. If you keep nothing in ~/input/TPM/ except for the two data files, and for example put the replicates file in ~/input/Ag_all_X0.txt

If this doesn't work, please let me know.

Best wishes Basel

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

Wen-Juan commented 5 years ago

Hi Basel,

I tried the option before, the same as you suggested, it did not work. I put the two data files in a subfolder /TPM/Ag_mod (there is no other files in this folder).

Here is the code: clust ~/input/TPM/Ag_mod/ -n 101 3 4 -r ~/input/TPM/Ag_all_X0.txt -o ~/output/tpm/Ag_all/

The errors I got: /===========================================================================\ | Clust | | (Optimised consensus clustering of multiple heterogenous datasets) | | Python package version 1.8.10 (2018) Basel Abu-Jamous | +---------------------------------------------------------------------------+ | Analysis started at: Saturday 05 January 2019 (16:33:46) | | 1. Reading dataset(s) | Traceback (most recent call last): File "/anaconda/anaconda/envs/python2/bin/clust", line 11, in sys.exit(main()) File "/anaconda/anaconda/envs/python2/lib/python2.7/site-packages/clust/main.py", line 98, in main args.cs, args.np, args.optimisation, args.q3s, args.basemethods, args.deterministic) File "/anaconda/anaconda/envs/python2/lib/python2.7/site-packages/clust/clustpipeline.py", line 84, in clustpipeline returnSkipped=True) File "/anaconda/anaconda/envs/python2/lib/python2.7/site-packages/clust/scripts/io.py", line 46, in readDatasetsFromDirectory datafilesread = readDataFromFiles(datafileswithpath, delimiter, float, skiprows, skipcolumns, returnSkipped) File "/anaconda/anaconda/envs/python2/lib/python2.7/site-packages/clust/scripts/io.py", line 193, in readDataFromFiles usecols=range(skipcolumns, ncols), na_filter=True, comments=comm) File "/anaconda/anaconda/envs/python2/lib/python2.7/site-packages/clust/scripts/io.py", line 228, in pdreadcsv_regexdelim delimiter='\t', dtype=dtype, header=-1, skiprows=skiprows, usecols=usecols, na_filter=na_filter, comment=comments).values File "/anaconda/anaconda/envs/python2/lib/python2.7/site-packages/pandas/io/parsers.py", line 678, in parser_f return _read(filepath_or_buffer, kwds) File "/anaconda/anaconda/envs/python2/lib/python2.7/site-packages/pandas/io/parsers.py", line 440, in _read parser = TextFileReader(filepath_or_buffer, kwds) File "/anaconda/anaconda/envs/python2/lib/python2.7/site-packages/pandas/io/parsers.py", line 787, in init self._make_engine(self.engine) File "/anaconda/anaconda/envs/python2/lib/python2.7/site-packages/pandas/io/parsers.py", line 1014, in _make_engine self._engine = CParserWrapper(self.f, self.options) File "/anaconda/anaconda/envs/python2/lib/python2.7/site-packages/pandas/io/parsers.py", line 1708, in init self._reader = parsers.TextReader(src, **kwds) File "pandas/_libs/parsers.pyx", line 542, in pandas._libs.parsers.TextReader.cinit pandas.errors.EmptyDataError: No columns to parse from file

Thanks.

Best, Wen-Juan

BaselAbujamous commented 5 years ago

Hi Wen-Juan

Thanks for reporting this.

I guess, from the error that I see, that the first row or few rows of your data file(s) don't seem to be in the right format. Can you share the first few rows of your data file? You can replace any confidential names (e.g. gene names) with any anonymous labels if you like.

Wen-Juan commented 5 years ago

Hi Basel,

Sure! Ideally here I would like to generate clusters with time series of gene expression for both female and male samples. For example, I would like to see whether in a given cluster of gene expression, there are differences between sexes. Eventually, I want to include several populations (each with 5 time points and each time point with both male and female samples).

In the directory of /TMP/Ag_mod/ I have two files: Ag_tpm_female.txt, Ag_tpm_male.txt.

Top 3 lines of Ag_tpm_female.txt: id TT_FemaleG232 TT_FemaleG234 TT_FemaleG235 TT_FemaleG236 TT_FemaleG238 TT1_Female271 TT1_Female272 TT1_Female273 TT1_Female274 TT1_Female275 TT2_Female311 TT2_Female312 TT2_Female313 TT2_Female314 TT2_Female315 TT3_Female431 TT3_Female432 TT3_Female433 TT3_Female434 TT_Female232 TT_Female233 TT_Female234 TT_Female235 TT_Female236 TT_Female237all TT1_Female271 TT1_Female273 TT1_Female274 TT1_Female275 TT1_Female276 TT2_Female311 TT2_Female312 TT2_Female313 TT2_Female314 TT2_Female315 TT2_Female316 TT3_Female431 TT3_Female432 TT3_Female433all TT3_Female434 TT4_Female462 TT_Female231 TT_Female232 TT_Female233 TT_Female234 TT_Female235 TT1_Female271 TT1_Female272 TT1_Female274 TT1_Female275 TT1_Female276 TT2_Female311 TT2_Female312 TT2_Female314 TT2_Female315 TT2_Female319 TT3_Female431 TT3_Female432 TT3_Female433 TT3_Female435 TT3_Female436 TT3_Female437 TT4_Female462 TT4_Female463 TT4_Female464 TT4_Female465 TT4_Female466 TT_Female231 TT_Female232 TT_Female233 TT_Female234 TT_Female236 TT1_Female272 TT1_Female273 TT1_Female274 TT1_Female275 TT1_Female276 TT2_Female311 TT2_Female312 TT2_Female313 TT2_Female314 TT2_Female315 TT3_Female431 TT3_Female434 TT4_Female464 TRINITY_DN100_c0_g2_i1 1.08259 0 0 0 0.547524 1.24369 0.91497 1.43161 0.739718 2.84145 0.586869 1.83045 1.96698 2.37758 3.6263 4.27313 1.60983 5.75609 5.94444 8.4615 4.20354 3.23867 5.88534 15.5491 4.73365 6.79381 11.488 10.1209 8.35174 3.00226 9.05169 9.96282 8.91953 11.9191 13.986 11.0725 11.1143 6.15106 11.4093 9.5178 2.56684 0.35178 0 0 2.6477 1.37995 2.03958 3.20233 7.84325 2.99876 0.538925 8.78078 5.88428 2.93337 2.56948 5.60748 7.94872 11.0992 8.73068 5.80076 6.1802 6.37798 10.9065 5.41789 8.2359 5.959 8.35464 7.30541 0.805146 0.779957 0.851122 1.80337 3.25727 2.77207 2.20637 7.35828 1.30126 2.68661 1.33293 1.82267 2.79375 3.73391 2.27035 4.88231 3.06928 TRINITY_DN100010_c0_g3_i1 0.197994 0.322517 0 0.135897 0.434116 0.352975 0.402469 0.322835 0.469841 0.353083 1.20521 1.06715 0.360563 0.923507 0 0 0.488354 0 0 0.868854 0.387772 1.44637 0.755963 0.969054 0.332735 0.848172 0.490811 2.03651 0.941856 0.740636 0.431545 1.30117 0.720113 0.74403 1.55787 1.64629 0.12553 0.416696 0.546588 0 0.345468 0.398135 0.137103 0.570803 0.123617 0.127285 0 0 0.366113 0 0 0.554507 0.301362 0.146943 0.856108 0.78581 0 0.434423 0.13842 0.102971 0 0 0.137515 0.139651 0 0.14632 0.124779 0.182378 0.266746 0.130605 0.257209 0.172001 1.03458 0 0 0.55922 1.66813 0.362139 0.709337 0.211435 0.151319 0 0.17487 0.294148 0.309522

Top 3 lines of Ag_tpm_male.txt: id Time_Male232 Time_Male234 Time_Male235 Time_Male236 Time_Male238 Time1_Male271 Time1_Male272 Time1_Male273 Time1_Male274 Time1_Male275 Time2_Male311 Time2_Male312 Time2_Male313 Time2_Male314 Time2_Male315 Time3_Male41 Time3_Male42 Time3_Male43 Time3_Male44 Time_Male232 Time_Male233 Time_Male234 Time_Male235 Time_Male236 Time_Male237all Time1_Male271 Time1_Male273 Time1_Male274 Time1_Male275 Time1_Male276 Time2_Male311 Time2_Male312 Time2_Male313 Time2_Male314 Time2_Male315 Time2_Male316 Time3_Male431 Time3_Male432 Time3_Male433all Time3_Male434 Time4_Male461 Time4_Male463 Time4_Male464 Time_Male231 Time_Male232 Time_Male233 Time_Male234 Time_Male235 Time1_Male271 Time1_Male272 Time1_Male274 Time1_Male275 Time1_Male276 Time2_Male311 Time2_Male312 Time2_Male314 Time2_Male315 Time2_Male319 Time3_Male431 Time3_Male432 Time3_Male433 Time3_Male435 Time3_Male436 Time3_Male437 Time_Male231 Time_Male232 Time_Male233 Time_Male234 Time_Male236 Time1_Male272 Time1_Male273 Time1_Male274 Time1_Male275 Time1_Male276 Time2_Male311 Time2_Male312 Time2_Male313 Time2_Male314 Time2_Male315 Time3_Male431 Time3_Male434 Time4_Male461 Time4_Male462 TRINITY_DN100_0_g2_i1 1.08259 0 0 0 0.547524 1.24369 0.91497 1.43161 0.739718 2.84145 0.586869 1.83045 1.96698 2.37758 3.6263 4.27313 1.60983 5.75609 5.94444 8.4615 4.20354 3.23867 5.88534 15.5491 4.73365 6.79381 11.488 10.1209 8.35174 3.00226 9.05169 9.96282 8.91953 11.9191 13.986 11.0725 11.1143 6.15106 11.4093 9.5178 12.1547 11.1879 7.41823 0.35178 0 0 2.6477 1.37995 2.03958 3.20233 7.84325 2.99876 0.538925 8.78078 5.88428 2.93337 2.56948 5.60748 7.94872 11.0992 8.73068 5.80076 6.1802 6.37798 7.30541 0.805146 0.779957 0.851122 1.80337 3.25727 2.77207 2.20637 7.35828 1.30126 2.68661 1.33293 1.82267 2.79375 3.73391 2.27035 4.88231 5.87353 4.43214 TRINITY_DN100010_0_g3_i1 0.197994 0.322517 0 0.135897 0.434116 0.352975 0.402469 0.322835 0.469841 0.353083 1.20521 1.06715 0.360563 0.923507 0 0 0.488354 0 0 0.868854 0.387772 1.44637 0.755963 0.969054 0.332735 0.848172 0.490811 2.03651 0.941856 0.740636 0.431545 1.30117 0.720113 0.74403 1.55787 1.64629 0.12553 0.416696 0.546588 0 0 0.260565 0 0.398135 0.137103 0.570803 0.123617 0.127285 0 0 0.366113 0 0 0.554507 0.301362 0.146943 0.856108 0.78581 0 0.434423 0.13842 0.102971 0 0 0.182378 0.266746 0.130605 0.257209 0.172001 1.03458 0 0 0.55922 1.66813 0.362139 0.709337 0.211435 0.151319 0 0.17487 0.294148 0.277197 0.121345

Then the replicate file is Ag_all_X0.txt, and the file looks like this: Ag_tpm_male.txt Time Time_Male231 Time_Male231 Time_Male232 Time_Male232 Time_Male232 Time_Male232 Time_Male233 Time_Male233 Time_Male233 Time_Male234 Time_Male234 Time_Male234 Time_Male234 Time_Male235 Time_Male235 Time_Male235 Time_Male236 Time_Male236 Time_Male236 Time_Male237all Time_Male238 Ag_tpm_male.txt Time1 Time1_Male271 Time1_Male271 Time1_Male271 Time1_Male272 Time1_Male272 Time1_Male272 Time1_Male273 Time1_Male273 Time1_Male273 Time1_Male274 Time1_Male274 Time1_Male274 Time1_Male274 Time1_Male275 Time1_Male275 Time1_Male275 Time1_Male275 Time1_Male276 Time1_Male276 Time1_Male276 ... ... ... Ag_tpm_female.txt TT TT_Female231 TT_Female231 TT_Female232 TT_Female232 TT_Female232 TT_Female233 TT_Female233 TT_Female233 TT_Female234 TT_Female234 TT_Female234 TT_Female235 TT_Female235 TT_Female236 TT_Female236 TT_Female237all TT_FemaleG232 TT_FemaleG234 TT_FemaleG235 TT_FemaleG236 TT_FemaleG238 Ag_tpm_female.txt TT1 TT1_Female271 TT1_Female271 TT1_Female271 TT1_Female272 TT1_Female272 TT1_Female272 TT1_Female273 TT1_Female273 TT1_Female273 TT1_Female274 TT1_Female274 TT1_Female274 TT1_Female274 TT1_Female275 TT1_Female275 TT1_Female275 TT1_Female275 TT1_Female276 TT1_Female276 TT1_Female276 ... ... ...

If you spot anything in the format, please let me know.

Many thanks.

Best, Wen-Juan

BaselAbujamous commented 5 years ago

Hi Wen-Juan

An interesting research problem and clust is indeed the tool to tackle it. I am keen to make sure that it works for you!

I have noticed that sometimes the column names are split by spaces and sometimes by tabs. Also, some column names are redundant, which causes issues. If sorting out these bits doesn't solve the problem, there might be other issues in the format of the files, and I would be happy to fully look into the problem if you like to confidentially share the data files with me (you can directly email me at basel.abujamous@plants.ox.ac.uk ).

Best wishes Basel

Wen-Juan commented 5 years ago

Hi Basel,

Thanks for your suggestions.

However, I am not sure about the issues you found.

When I separately ran the two files (exactly the same), they both worked successfully; so i am not sure about the column names are split by spaces and by tabs comment, i double checked the file and they seem to be fine.
The column names redundant are allowed, from the manual it said if the sample with same name will be automatically calculated the average value. Also, it proves to be fine, as when I ran the separate file with the same names, it ran successfully.

I will double check all these points you mentioned. If it still does not work, I will send you the files to your email address.

Many thanks.

cheers, Wen-Juan

Wen-Juan commented 5 years ago

Hi Basel,

As you suggested, I have sent you an email with my datasets. Have you received them yet?

Many thanks.

cheers, Wen-Juan

BaselAbujamous commented 5 years ago

Hi Wen-Juan

I received your email and looked into the datasets. Clust runs successfully on them (sending you the results confidentially by email). However:

The error is because the datasets folder includes a system file called .DS_Store which Clust tries to read as a third dataset. This file should be removed from the folder so that Clust runs successfully. Maybe in future versions I will try to let Clust detect such system files to avoid reading them as data files.
The gene names in the two datasets' files are not identical. The female dataset has gene names in the format ...c0... while the male dataset has gene names in the format ...0...; so I replaced all c0 strings in the female dataset with 0 strings before running Clust. The same applies for c1, c2, etc.
I also included the parameter -d 2 in the running command to include genes that appear in both datasets in the clustering step.

I hope this helps. Basel

Wen-Juan commented 5 years ago

Hi Basel,

Many thanks for pointing out the errors.

I look forward to receiving these files and re-try to run Clust given the suggestions.

Thanks again.

Best, Wen-Juan

Wen-Juan commented 5 years ago

Hi Basel,

Clust worked well for me now. I think we can close this issue.

Many thanks.

Best, Wen-Juan

BaselAbujamous / clust

multiple datasets run into issues #23