Closed fanvanf closed 10 months ago
Hi @fanvanf.
Thanks for this. I have decided I will incorporate your code as default, but will also give users the option to use the original code, as it provides for more functionality (if they want chromosome_mapped_long.fastq or multimap_plasmid_chromosome_long.fastq).
Therefore, I will not incorporate your PR directly and so won't accept the PR, but I will take the bulk of the code and acknowledge you in the comments of it.
If you want to be acknowledged formally in the repo contributor, please make a cosmetic PR (for example, add yourself to CONTRIBUTING.md) and I will accept that.
George
This PR aims to improve the runtime performance of sam_to_fastq.py when handling large datasets #29 . Key changes include:
Remove unused output files chromosome_mapped_long.fastq and multimap_plasmid_chromosome_long.fastq to avoid unnecessary I/O
Replace nested loops in extract_bin_long_fastqs with samtools/awk filtering to reduce processing time
Benchmarking on large datasets shows significant speedups:
Vibrio campbellii DS40M4: 204 secs -> 5 secs
Streptomyces clavuligerus DSM 738: 50+ mins -> 2 mins
These optimizations will help scale sam_to_fastq.py to handle large real-world datasets more efficiently. Please review and consider merging these performance improvements.