gbouras13 / plassembler

Program to quickly and accurately assemble plasmids in hybrid and long-only sequenced bacterial isolates
MIT License
49 stars 3 forks source link

Optimize sam_to_fastq.py performance on large datasets #30

Closed fanvanf closed 10 months ago

fanvanf commented 11 months ago

This PR aims to improve the runtime performance of sam_to_fastq.py when handling large datasets #29 . Key changes include:

Remove unused output files chromosome_mapped_long.fastq and multimap_plasmid_chromosome_long.fastq to avoid unnecessary I/O

Replace nested loops in extract_bin_long_fastqs with samtools/awk filtering to reduce processing time

Benchmarking on large datasets shows significant speedups:

Vibrio campbellii DS40M4: 204 secs -> 5 secs

Streptomyces clavuligerus DSM 738: 50+ mins -> 2 mins

These optimizations will help scale sam_to_fastq.py to handle large real-world datasets more efficiently. Please review and consider merging these performance improvements.

gbouras13 commented 10 months ago

Hi @fanvanf.

Thanks for this. I have decided I will incorporate your code as default, but will also give users the option to use the original code, as it provides for more functionality (if they want chromosome_mapped_long.fastq or multimap_plasmid_chromosome_long.fastq).

Therefore, I will not incorporate your PR directly and so won't accept the PR, but I will take the bulk of the code and acknowledge you in the comments of it.

If you want to be acknowledged formally in the repo contributor, please make a cosmetic PR (for example, add yourself to CONTRIBUTING.md) and I will accept that.

George