FelixKrueger / SNPsplit

Allele-specific alignment sorting
http://felixkrueger.github.io/SNPsplit/
GNU General Public License v3.0
52 stars 20 forks source link

Supporting soft-clipping #28

Closed Shians closed 3 years ago

Shians commented 4 years ago

Would it be particularly difficult to support soft-clipping in reads? This would be very useful for use with alternative aligners, in my case I'm using minimap2 for Nanopore long reads.

FelixKrueger commented 4 years ago

I don't think it would be particularly hard, it is nevertheless probably a fair bit of work. Given that we have recently added support for soft-clipped reads in Bismark, we could probably justify this for SNPsplit as well.

Even if we were to enable this, I am not sure whether the whole approach would scale to long Nanopore reads though (with tens of thousands of operations in the CIGAR and MD:Z: fields etc). In case I would start looking at this next week, would you have a file with a few lines of Nanopore alignments produced by minimap2 for me as a test case?

FelixKrueger commented 4 years ago

I just got hold of an ONT BAM file mapped with minimap2. The first thing I noticed is that the file does not have an MD:Z:... string by default, and hence SNPsplit died. This string can be re-generated with the genome reference file, but I ideally we would like to have it included in the file to start with (as this step is quite slow). For reference, this samtools command allows one to regenerate it:

samtools calmd ONT_minimap2.bam GRCh38.fa | samtools view -bS - > ONT_minimap2_MDZ.bam

While looking at the options we have here, I found on the Minimap2 page that it can optionally also print a CS-tag: https://github.com/lh3/minimap2#the-cs-optional-tag. In contrast to the MD:Z string, which cannot be parsed on its own but requires parallel processing of the MD:Z:, CIGAR and sequence fields, the CS-tag encodes all necessary information in a single string (i.e. matches, mismatches, deletions, insertions). It might be a nice addition to support reading this flag instead of the conventional processing, e.g. with an otion --cs. But first things first, lets see if we can get soft-clipping to be recognised.

@Shians, if and when you are able to supply a test file, could it ideally please be:

Many thanks!

FelixKrueger commented 4 years ago

Hi @Shians

I've got some good news, and some not quite yet good news.

The good news: Soft-clipping support in SNPsplit seems to be working now!

The not quite there yet good news: The Nanopore data I have at hand is quite funky.

MD:Z:1T2G0C0T0T0C1G0A0G0G0G0T0G0G0C0T0T0T0G0T0C0T2G0T1G1A0G0T0G0G0A1G0C0A0A0T2
T0C0C0C0C0A1C1T0C1T1C0^CC0A0A0A0A0G1G0G0C0A0T1A0G0A2C0A0A2C0T1C0T0G1T0C0T2G0C
0C0C0A0A0A0C0C0A0T0T0G1C0C0T0G1T0T0C1G0^C2C2G2C0T0G2G0G1G0C0T0T2G0G0C1G0G0G0G
0A0A0G0A0G0C0T0G0C1C0A2G1^GG0G0G1T1G0G0T2A0^GCCT0C0T0G0C0T1T0G0T0G0A0C1T2G
0A0T0G0A0G0T1A0C1C0A0C0C1C0T0C1G0A0C0C0A0T0C0T0A0C0A1A1T0G0T2G0C0A0C0C0A1G0A0
C0C0A1C0G0G1G0^A1C1^T2G0G0T0T1G0G1A1T1G0G1G0G0C0T0T0C0C0T0T0G0G0A0G0A2T0G1G0
A0G0T0G0G0G0G0A0G0G0G2G0G0G0G0C0T0G0A1T0C0T0G1G1C0C1G0G0A1T1G0G0A0T0C2C0C0A2
A2A0G0A0G0C0A1G0C0A4G0A0A0A0T0G0C0T2T0G0G0C0A0G0C0G0G0C1G0C0A0G0T0G0G0G0C0G0
T0G0G0C2C0A0G1^C0T0T0T1C0A0G0C0T0C0C0C0T0T0C1G0C1G2A1A0C1C0C1C0T0C0A0T0G0C0C0C
0C0G0C1C0C0C0T0G0G0G0T0C0C1T0C0A4^CCT0C1C0C0T0C0^A1A0C0C0^CT1G0G0C0T2^C0T0C0G
0G0C1C0A0G2G0A1A0G0C0C0T1G0A0G0G0A0G0G0G0G0A0T0G0G0G0G0C1^C0A0T1C0T0^TG0T0T1
T0C0A0C0T0T0C0A0G3C0G0C0T0T0T1G0G1A0T0A0G0C0^CAC0C2C1G2G0G0C0G1C0G1G0G0C0G0G
0G0G1G0G0G0T1G0T0T0G0G0G1G0C1T0G0G0^TT0G0G0G1G0G0T0G0C0T1A0G0A0G0G1T1C0A1G0A
0T0A0A2T0T1C0A0C0A0G0A0C0T0T1G0G0T0T0T0G1A0A1C0C0A0C0T1G0C0C3T0G0G1C0T0C0T0C4C
0G1A0G0C2C1T0C0C0T1T0G1A1^AA0T0G0G0G0G0A0T0C1A0A0C0C0G0T0T0C1C1C0C0C1A1A0G0G0
G0A0G2G0G0C0G0C0A0C0A0C0C2C0A0T0T0C1A0G0G0C0T0G0G1T0G1A0T0C0C0C0T1G0G0C1G0C0
G0G1G1C0C1C0C0C0C2C1A1G1C0G1A0A0G1C0C0T0G1C0C0T0C0C1C0T0T0C0T0G0T0C0T1T0C0C0C0
T0G0T0C0C0G1G0C1G0T1G0G0A1A0G0^C0A0G0G1C0A0G0A0T0G0G0C0T1A0G0G0G0A0G0G0A0G0
G0C2G0A0T0G0G0G0G0G0A0G1G0G0G2C0T0A0C0A1C0C0A0C5G0G0T0G0G2G0T0G2G1C2A0C0C0T0
G1C0A0T0C0A0G0T0G0T1G0C1C1C0A0G0G0C0G0T0C0^C0T0G0T1C0A3T0C1A1G1C2G0T0C0T0T0C0
C0C0A0C0T0T0C1C0^T0G0T1C0G0G0G0A0T1A2G0G0A0G0G0G0G6T0G0C0G0G0C0C0^AC0C1G0C0G
0G0G0G0C0C0T1C0A1A0T0T0C0C0G0G0C0T0C0C0T0^G0G0C0A0G1C1T0C0A0A0C0A1C0G0A1C0A0G
0G0G0T1A0G0C0C0C0C0C1G0G3G0C0C0T1A0C0C0C0T0G0G0C0C0C0T1C0C0T0G0G0G0G0G0C0C0G0
G0G0G0C2G0G0G0A0G1C1T0C1C2C0T0C0C1C1G1G0T0A0C0A0T1T0C0T0T0G0C0^T0C1T1C0C1T0C0C
0C0^TC0T0C0T0C2C2C0T0T0T2C0C1C0C0T1C1^T0G1T1C0C0A0C2C0C0T0T0C0A1G0G0A0G0G0A0T0
G0G0A0A0G0G0G1C0T1A0C2G1G0T1G0A0G1C0C0C0T0G1C2G0T0G0T0T0G0A0G0G0G0G2A0G0G0G0
G0C5G0G0G0T0A0G0A0A0^G0G0G0A1G1G0G0C2G0G0C0A0G0G0G0A0G0G0C0A0A0G0A0G1C0T0G0
G0T0T0G0T0G0G0G0G0G0C0C0T0G0^G0A0A0T1C0C0A0G0G3C0A0G1T0T0C1G0A0G0T0C0A0G0G6C
0A0C0C0C0C0C1C0C0A0C1C4C2T0T1G0T0C0T0G3C0T0T0G0A0G2T0G0G0G0T0C0G0G0^GT0G0G0G0
A0^G0C0G0C0C0A3T0G0G0C1C0C0C0C0A1T0G0C0C0C0T0C0C0T0T0C0C0C0C0A0G0A0^G0A0C0C0A
3C0C0T2C0T2A0C0A0A0G0A0C0C0A0G2T0C0C0G1G0T0G0G0A0C1T1C0C2T0C0G2C1G0C0C1G0A0G0
A1C2C1T0T0T2G2G0C0G0C1G0G0G0G1G0A2G0G0G0T0G0C1T0T0G0G1C0C1C0T0G3G0T0C0C0A0A1A
1G0C1T1C0C0C0C0C0C1A0G1C0C0T0G0G0A0C0T0G0C1G0C0C0C0C0T0G0G0T1C0T0G0G0G2C1G1C1
C0T0G0G0T3A0C0C1C0C0C0T1C0A0C0C0C0C2A0G0T0G0G0C0A0T0C0T0G0C0G0G0C0G0T0C0C1G0A
0^GC1G1G0C2A0C0C1C0T0T0C2T0C0A0G0C0G2C0C2C0C0T0C0A0G0C0T0T0C0A0T2A0G0A0C0C0A0
A1C0G0G0T1C0A3C0C1A0A0C0T0G0C0T0G0^G0C1A0C1A0G0G0T2G0C0T0C3G0C0T0A0G0G0G0G0C
1G0G0G0G0A1A0T0C0T0C0A0G0T1A0G1C1C0T0T0G0T0G0G0A0C0T1C1G0A0C1T1A0T0G0T0A0G0A0
A1G0C0C0C0C0A0C0C0C0C3

Currently, reads that have deletions of more than 1 annotated N positions are skipped, as they are quite difficult to process, and occur hardly ever. In 'normal' data. I would imagine that in reads this long, multi-N deleted reads could occur quite regularly, in which case we would need to revisit how those reads are treated

To address this last point, I would really like to get some clean hybrid test data along with an annotated list of SNP positions so we could make some informed decisions. I will for the moment hold back with any further modifications if that's OK.

The current version of SNPsplit that supports soft-clipping can be cloned from development branch. Let me know how you are getting on.

FelixKrueger commented 4 years ago

https://github.com/FelixKrueger/SNPsplit/pull/30

FelixKrueger commented 3 years ago

I will now close this issue as I would like to make a new release for SNPsplit.

If the Nanopore issue comes back up and someone can provide some clean ONT hybrid data we can always revisit.