blachlylab / fade

Fragmentase Artifact Detection and Elimination
MIT License
11 stars 3 forks source link

Bam sorting #18

Closed charlesgregory closed 3 years ago

charlesgregory commented 3 years ago

fade can be cumbersome and unintuitive when it comes to sorting.

Ideally, the input bam to fade out is queryname sorted. This allows fade out to eject an artifact mate pair there is evidence of a read or its mate being an enzymatic fragmentation artifact. The logic behind this is that we cannot trust the mate of a read that contains an artifact because it is expected the whole insert is artifactual.

If the -c flag is provided for only clipping artifact reads, nothing will be done to the mate of an artifact read.

If the bam is not queryname sorted, fade out will only eject the reads individually and not consider them as paired.

Additionally, fade annotate works in parallel, even if the input bam is coordinate or queryname sorted, the output bam will NOT be sorted.

fade out may also NOT result in a coordinate-sorted bam, even if samtools sort is run after fade annotate. fade out modifies the alignment position of artifactual reads if the -c flag is provided. Hard-clipping the artifact regions of the reads changes the alignment position. In the end, we may be looking at fade's overall execution for a pipeline as:

fade annotate -b input.bam ref.fa > input.anno.bam
samtools sort -n input.anno.bam > input.qns.bam
fade out -b input.qns.bam > input.out.bam
samtools sort input.out.bam > final.bam
samtools index final.bam

It may be a significant bit of work, but we should potentially support resorting input and output bam files internally.

# output either is coordinate sorted by fade
# or retains original sorting
fade annotate -b input.bam ref.fa > input.anno.bam 

# fade internally resorts input based on queryname 
# if -c flag is not used
# output is always resorted to coordinate sorting 
fade out -b input.qns.bam > input.out.bam

Alternatively, we could warn the user when their bam file appears unsorted and warn that fade's output is not sorted.

@jblachly thoughts?

jblachly commented 3 years ago

Suggest adding WARNING class logging calls to any operation (fade annotate; fade out -c) that could potentially shuffle (desort) user input