Open tseemann opened 6 years ago
Correct, samblaster does this calculation to determine which clipped reads to put into the unmapped/clipped fastq file, but not to filter the output SAM file. You are already aware that neither samtools nor sambamba have any straight forward filter for this either. As you state in your other post, one could use awk to do this, but if you want a robust way to determine the length on either end (not combined) it is not so easy. I have even seen alignments with CIGAR strings that looks like 2H4S96M which require adding the two clips to get the right answer (samblaster properly handles these in its clipping length calculations).
Clearly this would be an easy feature to add to samblaster as it is already doing the calculation (although it is not directly related to the tasks it is trying to achieve). Can I ask your goal is removing alignments with a "significant" clip? I have another unpublished tools that also calculates starting and ending query offsets from CIGAR strings written in both Python (slow) and C (fast) but they don't work on SAM files. If I know your goal, it would help me decide what if anything I would want to publish vs. just give you on the side.
I want to remove them from the alignment if the clips exceed some length, just like samblaster does. This is essentially a hack to get around the fact that bwa
and minimap[2]
still don't support global alignment. I could use bowtie2 --end-to-end
but reasons.
The awk solution is simple (in the simple cases!) and I can write my own version in C (not C++), but if you already have something that would be good.
I see it being in a pipe as follows:
[aligner REF R1 R2 -> SAM]-> samblaster2 -> sort-n -> fixmate -> sort-p -> markdup -> BAM
I now realise it is slightly more complicated. I need to allow soft-clipped reads through the filter when the hit the end of a contig, as we get low coverage there and we expect it to be clipped anyway.
I now have a purePerl script (only core modules) to filter H
/S
clips to a threshold --max
. I;ve added a pass-through check for the contig-end cases. I will use the ref.fasta.fai
index for the lengths.
Ok. I will still consider the general issue of filtering clipped reads from the output file, so will leave this issue open for now even though it seems you have solved your specific issue.
Yes it seems to be working well. Thanks for answering my query. I have an -v
option which outputs the reads that would be removed instead. This can be used to approximately do what samblaster
does. eg. samclip -v < samtools.sam | samtools fastq > clipped.fq
I was looking for a fast tool to put into a Unix pipe between bwa and samtools which would remove any alignments with
S
orH
CIGAR clipping, as opposed to extracting said reads.It doesn't seem
samblaster
can do this?