Kinggerm / GetOrganelle

Organelle Genome Assembly Toolkit (Chloroplast/Mitocondrial/ITS)
GNU General Public License v3.0
267 stars 51 forks source link

Speedup suggestion during initial FASTQ decompression #53

Open edgardomortiz opened 4 years ago

edgardomortiz commented 4 years ago

Thanks for developing GetOrganelle, it seems very complete and thorough. I am trying it for species of Ericaceae, hopefully it will handle the small repeats better than other software I tried in the past (any tips to improve these assemblies are welcome).

However, during my initial tests in a Mac I noticed it takes an excessive amount of time just to decompress the FASTQ files at the begginning (a file of ~5GB is taking more that 1.5 hours), my guess is that the combination of Mac's head + gunzip is the reason, I found that many of Mac's own standard programs are really slow compared to Linux's versions. My suggestion would be to use Python's own gzip library to decompress and compress reads more quickly, if not, the BBTools suite (https://jgi.doe.gov/data-and-tools/bbtools/) handles FASTQ files very fast as well, and a random subsampling could be performed with its program reformat.sh

Edgardo

Kinggerm commented 4 years ago

Hi Edgardo,

Thanks for using GetOrganelle and for the kind suggestion. I will carefully consider and test it.

As for Ericaceae, it will still be difficult with only illumina data. I am developing another tool/function for utilizing long read sequencing reads for this. Hopefully it will be helpful if you have these kind of data.

Best, Jianjun

Sh1ne111 commented 4 years ago

Hi,I conducted GetOrganelle and found these Errors like this:

GetOrganelle v1.7.1

get_organelle_from_reads.py assembles organelle genomes from genome skimming data. Find updates in https://github.com/Kinggerm/GetOrganelle and see README.md for more information.

Python 3.7.6 | packaged by conda-forge | (default, Jun 1 2020, 18:57:50) [GCC 7.5.0] PYTHON LIBS: GetOrganelleLib 1.7.1; numpy 1.19.1; sympy 1.6.2; scipy 1.3.0; psutil 5.4.7 DEPENDENCIES: Bowtie2 /public/home/aaa/anaconda3/bin/bowtie2-align-s); SPAdes 3.13.0; Blast 2.9.0 LABEL DB: embplant_mt customized; embplant_pt customized WORKING DIR: /public/home/aaa/project/01_tea/DASZ_mt/assemble /public/home/aaa/anaconda3/bin/get_organelle_from_reads.py -s tea.mt.fasta -1 DASZ.R1.fastq.gz -2 DASZ.R2.fastq.gz -o DASZ_mt -R 50 -k 55,85,115,125,135 -F embplant_mt -t 6

2020-09-29 12:59:33,138 - INFO: Pre-reading fastq ... 2020-09-29 12:59:33,139 - INFO: Estimating reads to use ... (to use all reads, set '--reduce-reads-for-coverage inf') 2020-09-29 12:59:33,365 - INFO: Tasting 100000+100000 reads ... 2020-09-29 12:59:34,205 - ERROR: Traceback (most recent call last): File "/public/home/fafu_chenshuai/anaconda3/bin/get_organelle_from_reads.py", line 3750, in main random_seed=options.random_seed, verbose_log=options.verbose_log, log_handler=log_handler) File "/public/home/fafu_chenshuai/anaconda3/bin/get_organelle_from_reads.py", line 1014, in estimate_maximum_n_reads_using_mapping which_bowtie2=which_bowtie2) File "/public/home/fafu_chenshuai/anaconda3/lib/python3.7/site-packages/GetOrganelleLib/pipe_control_func.py", line 373, in map_with_bowtie2 raise Exception("") Exception

Total cost 26.55 s Please email jinjianjun@mail.kib.ac.cn or jianjun.jin@columbia.edu if you find bugs!

Kinggerm commented 4 years ago

Hi,I conducted GetOrganelle and found these Errors like this:

GetOrganelle v1.7.1

get_organelle_from_reads.py assembles organelle genomes from genome skimming data. Find updates in https://github.com/Kinggerm/GetOrganelle and see README.md for more information.

Python 3.7.6 | packaged by conda-forge | (default, Jun 1 2020, 18:57:50) [GCC 7.5.0] PYTHON LIBS: GetOrganelleLib 1.7.1; numpy 1.19.1; sympy 1.6.2; scipy 1.3.0; psutil 5.4.7 DEPENDENCIES: Bowtie2 /public/home/aaa/anaconda3/bin/bowtie2-align-s); SPAdes 3.13.0; Blast 2.9.0 LABEL DB: embplant_mt customized; embplant_pt customized WORKING DIR: /public/home/aaa/project/01_tea/DASZ_mt/assemble /public/home/aaa/anaconda3/bin/get_organelle_from_reads.py -s tea.mt.fasta -1 DASZ.R1.fastq.gz -2 DASZ.R2.fastq.gz -o DASZ_mt -R 50 -k 55,85,115,125,135 -F embplant_mt -t 6

2020-09-29 12:59:33,138 - INFO: Pre-reading fastq ... 2020-09-29 12:59:33,139 - INFO: Estimating reads to use ... (to use all reads, set '--reduce-reads-for-coverage inf') 2020-09-29 12:59:33,365 - INFO: Tasting 100000+100000 reads ... 2020-09-29 12:59:34,205 - ERROR: Traceback (most recent call last): File "/public/home/fafu_chenshuai/anaconda3/bin/get_organelle_from_reads.py", line 3750, in main random_seed=options.random_seed, verbose_log=options.verbose_log, log_handler=log_handler) File "/public/home/fafu_chenshuai/anaconda3/bin/get_organelle_from_reads.py", line 1014, in estimate_maximum_n_reads_using_mapping which_bowtie2=which_bowtie2) File "/public/home/fafu_chenshuai/anaconda3/lib/python3.7/site-packages/GetOrganelleLib/pipe_control_func.py", line 373, in map_with_bowtie2 raise Exception("") Exception

Total cost 26.55 s Please email jinjianjun@mail.kib.ac.cn or jianjun.jin@columbia.edu if you find bugs!

I'm sorry that your question is irrelevant to this issue. Please open another issue. I have to delete your question here soon.

harish0201 commented 3 years ago

@Kinggerm Does get organelle pull in Pigz as well when installing via conda? If so, that would be a lot better as pigz is foolishly fast!

Kinggerm commented 3 years ago

@harish0201 That's true. But currently pigz is not required for non-conda installation. Further incorporating needs more testing in different environment, it's on my plan though. Thanks for the suggestions.