blachlylab / fade

Fragmentase Artifact Detection and Elimination
MIT License
11 stars 3 forks source link

fade out runs extremely slow on a name sorted bam #26

Closed rhalperin closed 2 years ago

rhalperin commented 2 years ago

Using a trivially sized 0.5G bam for testing purposes, i found that it took ~10hrs to run fade out on the output of samtools sort -n. Watching the process on top it appears to have low cpu usage, and spends alot of time in the 'D' state. In comparison, running fade out on the same bam without sorting took about 20sec. I am seeing the same behavior running fade on a cloud workstation with ubuntu 18.04 and fade installed via conda as well as running in the blachlylab/fad docker image on my mac.

jblachly commented 2 years ago

Thanks for reporting this.

"D" state is uninterruptable state, often in Kernel space, typically IO. We will look in to this.

charlesgregory commented 2 years ago

Looks like bottleneck has to do with my use of D's std.algorithm : chunkBy and dhtslibs SAMReader.all_records range. I have experienced issues with this before, though I didn't realize it affected fade. Should have a fix out soon, just need to replace D's chunkBy algorithm.

jblachly commented 2 years ago

Interesting/surprising that it would manifest as interruptible sleep which again I believe is usually IO related

charlesgregory commented 2 years ago

@rhalperin Can you try v0.5.7? This should be fixed now.

rhalperin commented 2 years ago

That worked, it now runs in 30sec on the 0.5G bam, thanks!