bede / hostile

Precise host read removal
MIT License
74 stars 5 forks source link

Automatically generate and cache minimap2 indexes to eliminate redundant indexing overhead #39

Open bede opened 3 months ago

bede commented 3 months ago

For whatever reason, when initially implementing long read support using Minimap2, I was unable to demonstrate significantly reduced execution time versus recreating the index from scratch every time hostile clean is called. Using a prebuilt index was only marginally quicker and frankly not worth the complexity of managing indexes. However, recently I tested whether this is still the case and observed that running hostile clean on a small long read fastq drops from taking ~45s to ~7s through use of a precomputed index.

This behaviour should first be characterised / verified on Linux and MacOS. Assuming the performance benefits are replicated on both OSs, adding invisible (but suitably logged) index caching and reuse should be done unless a good reason not to do so becomes apparent.

This will dramatically reduce execution time for processing many long read samples where this redundant indexing overhead is painful.