czbiohub-sf / orpheum

Orpheum (Previously called and published under sencha) is a Python package for directly translating RNA-seq reads into coding protein sequence.
MIT License
18 stars 4 forks source link

add multiprocessing to sencha translate #84

Closed pranathivemuri closed 4 years ago

pranathivemuri commented 4 years ago

Many thanks to contributing to czbiohub/sencha!

Please fill in the appropriate checklist below (delete whatever is not relevant). These are the most common things requested on pull requests (PRs).

PR checklist

pranathivemuri commented 4 years ago
(sencha) ➜  sencha git:(master) ✗ sencha translate tests/data/SRR306838_GSM752691_hsa_br_F_1_trimmed_subsampled_n22.fq.gz tests/data/index/Homo_sapiens.GRCh38.pep.subset.fa.gz --save-peptide-bloom-filter --verbose

// Without parallel processing
22it [00:00, 5273.44it/s]
2662it [00:04, 533.92it/s]
time taken is 5.2929160594940186 seconds
(sencha) ➜  sencha git:(master) ✗ sencha translate tests/data/SRR306838_GSM752691_hsa_br_F_1_trimmed_subsampled_n22.fq.gz tests/data/index/Homo_sapiens.GRCh38.pep.subset.fa.gz --save-peptide-bloom-filter --verbose

22it [00:00, 6247.86it/s]
2662it [00:05, 525.78it/s]
time taken is 5.3656840324401855 seconds
(sencha) ➜  sencha git:(master) ✗ sencha translate tests/data/SRR306838_GSM752691_hsa_br_F_1_trimmed_subsampled_n22.fq.gz tests/data/index/Homo_sapiens.GRCh38.pep.subset.fa.gz --save-peptide-bloom-filter --verbose

22it [00:00, 6183.39it/s]
2662it [00:04, 538.34it/s]
time taken is 5.239797830581665 seconds
(sencha) ➜  sencha git:(master) ✗ sencha translate tests/data/SRR306838_GSM752691_hsa_br_F_1_trimmed_subsampled_n22.fq.gz tests/data/index/Homo_sapiens.GRCh38.pep.subset.fa.gz --save-peptide-bloom-filter --verbose

22it [00:00, 5382.96it/s]
2662it [00:04, 540.03it/s]
time taken is 5.250756740570068 seconds
(sencha) ➜  sencha git:(master) ✗ sencha translate tests/data/SRR306838_GSM752691_hsa_br_F_1_trimmed_subsampled_n22.fq.gz tests/data/index/Homo_sapiens.GRCh38.pep.subset.fa.gz --save-peptide-bloom-filter --verbose

22it [00:00, 5457.78it/s]
2662it [00:04, 534.72it/s]
time taken is 5.285132884979248 seconds

# Parallel processing
(sencha) ➜  sencha git:(pranathi-translate) ✗ sencha translate tests/data/SRR306838_GSM752691_hsa_br_F_1_trimmed_subsampled_n22.fq.gz tests/data/index/Homo_sapiens.GRCh38.pep.subset.fa.gz --save-peptide-bloom-filter --verbose

22it [00:00, 5962.05it/s]
time taken to translate is 3.96411 seconds
(sencha) ➜  sencha git:(pranathi-translate) ✗ sencha translate tests/data/SRR306838_GSM752691_hsa_br_F_1_trimmed_subsampled_n22.fq.gz tests/data/index/Homo_sapiens.GRCh38.pep.subset.fa.gz --save-peptide-bloom-filter --verbose

22it [00:00, 5541.36it/s]
time taken to translate is 3.90832 seconds
(sencha) ➜  sencha git:(pranathi-translate) ✗ sencha translate tests/data/SRR306838_GSM752691_hsa_br_F_1_trimmed_subsampled_n22.fq.gz tests/data/index/Homo_sapiens.GRCh38.pep.subset.fa.gz --save-peptide-bloom-filter --verbose

22it [00:00, 6440.17it/s]
time taken to translate is 3.90263 seconds
(sencha) ➜  sencha git:(pranathi-translate) ✗ sencha translate tests/data/SRR306838_GSM752691_hsa_br_F_1_trimmed_subsampled_n22.fq.gz tests/data/index/Homo_sapiens.GRCh38.pep.subset.fa.gz --save-peptide-bloom-filter --verbose

22it [00:00, 6106.46it/s]
time taken to translate is 3.87282 seconds
(sencha) ➜  sencha git:(pranathi-translate) ✗ sencha translate tests/data/SRR306838_GSM752691_hsa_br_F_1_trimmed_subsampled_n22.fq.gz tests/data/index/Homo_sapiens.GRCh38.pep.subset.fa.gz --save-peptide-bloom-filter --verbose

22it [00:00, 6410.19it/s]
time taken to translate is 3.92853 seconds
pranathivemuri commented 4 years ago

a 33% decrease in time, maybe will help. I am not sure about the recent memory error while writing the fasta sequence to the file (if it can be fixed via this PR) while writing the sequence though.

olgabot commented 4 years ago

Wow this is working!?!?? Amazing work!

pranathivemuri commented 4 years ago

yes! just had to declare a global variable for the node graph instead of it being a class object so it doesn't get serialized as an attribute of the class. Tried it on human makefile but unfortunately having errors with sambamba dedup timing out - going to rerun it today, probably will just try one bam file at a time if that doesn't work

pranathivemuri commented 4 years ago

@olgabot let me know if you want to merge or wait for any of the big pipeline runs to finish this branch.

olgabot commented 4 years ago

This is awesome! Let's make sure at least one of the pipelines finishes successfully before merging it in, in case there's some edge cases we run into with really big files.

phoenixAja commented 4 years ago

@pranathivemuri @olga mentioned that you might have a docker container for nf-predictorthologs with this new change added? I would be happy to try it out on the bat data!

pranathivemuri commented 4 years ago

@lekhakaranam @phoenixAja here are the multiprocess related changes added in dockerfile and main.nf - https://github.com/czbiohub/nf-predictorthologs/pull/78/files (note both must be changed) the container is on ndnd