GregoryFaust / samblaster

samblaster: a tool to mark duplicates and extract discordant and split reads from sam files.
MIT License
225 stars 30 forks source link

Filter out soft/hard clipped reads in a pipe SAM -> SAM ? #35

Open tseemann opened 6 years ago

tseemann commented 6 years ago

I was looking for a fast tool to put into a Unix pipe between bwa and samtools which would remove any alignments with S or H CIGAR clipping, as opposed to extracting said reads.

It doesn't seem samblaster can do this?

GregoryFaust commented 6 years ago

Correct, samblaster does this calculation to determine which clipped reads to put into the unmapped/clipped fastq file, but not to filter the output SAM file. You are already aware that neither samtools nor sambamba have any straight forward filter for this either. As you state in your other post, one could use awk to do this, but if you want a robust way to determine the length on either end (not combined) it is not so easy. I have even seen alignments with CIGAR strings that looks like 2H4S96M which require adding the two clips to get the right answer (samblaster properly handles these in its clipping length calculations).

Clearly this would be an easy feature to add to samblaster as it is already doing the calculation (although it is not directly related to the tasks it is trying to achieve). Can I ask your goal is removing alignments with a "significant" clip? I have another unpublished tools that also calculates starting and ending query offsets from CIGAR strings written in both Python (slow) and C (fast) but they don't work on SAM files. If I know your goal, it would help me decide what if anything I would want to publish vs. just give you on the side.

tseemann commented 6 years ago

I want to remove them from the alignment if the clips exceed some length, just like samblaster does. This is essentially a hack to get around the fact that bwa and minimap[2] still don't support global alignment. I could use bowtie2 --end-to-end but reasons.

The awk solution is simple (in the simple cases!) and I can write my own version in C (not C++), but if you already have something that would be good.

I see it being in a pipe as follows:

[aligner REF R1 R2 -> SAM]-> samblaster2 -> sort-n -> fixmate -> sort-p -> markdup -> BAM

tseemann commented 6 years ago

I now realise it is slightly more complicated. I need to allow soft-clipped reads through the filter when the hit the end of a contig, as we get low coverage there and we expect it to be clipped anyway.

I now have a purePerl script (only core modules) to filter H/S clips to a threshold --max. I;ve added a pass-through check for the contig-end cases. I will use the ref.fasta.fai index for the lengths.

See https://github.com/tseemann/samclip

GregoryFaust commented 6 years ago

Ok. I will still consider the general issue of filtering clipped reads from the output file, so will leave this issue open for now even though it seems you have solved your specific issue.

tseemann commented 6 years ago

Yes it seems to be working well. Thanks for answering my query. I have an -v option which outputs the reads that would be removed instead. This can be used to approximately do what samblaster does. eg. samclip -v < samtools.sam | samtools fastq > clipped.fq