Optimize sam_to_fastq.py performance on large datasets

gbouras13 / plassembler

Program to quickly and accurately assemble plasmids in hybrid and long-only sequenced bacterial isolates

MIT License

49 stars 3 forks source link

This PR aims to improve the runtime performance of sam_to_fastq.py when handling large datasets #29 . Key changes include:

Remove unused output files chromosome_mapped_long.fastq and multimap_plasmid_chromosome_long.fastq to avoid unnecessary I/O

Replace nested loops in extract_bin_long_fastqs with samtools/awk filtering to reduce processing time

Benchmarking on large datasets shows significant speedups:

Vibrio campbellii DS40M4: 204 secs -> 5 secs

Streptomyces clavuligerus DSM 738: 50+ mins -> 2 mins

These optimizations will help scale sam_to_fastq.py to handle large real-world datasets more efficiently. Please review and consider merging these performance improvements.

gbouras13 / plassembler

Optimize sam_to_fastq.py performance on large datasets #30