PeteHaitch / methtuple

methtuple is a methylation caller for methylation events that co-occur on the same DNA fragment from high-throughput bisulfite sequencing data, such as methylC-seq.
MIT License
12 stars 3 forks source link

Supporting other aligners #87

Closed PeteHaitch closed 8 years ago

PeteHaitch commented 10 years ago

comethylation only supports reads aligned with Bismark. It should be relatively straightforward to support other aligners but I won't be doing this until I have motivation to do so, i.e. I need to analyse data aligned with another aligner or someone using comethylation requests support. Pull requests adding support for additional aligners are most welcome.

Other aligners should be supported via a command line option --aligner flag, e.g. --aligner bwa-meth. I prefer this approach to my previous idea of having a separate script to "Bismark-ify" the SAM/BAM file.

Aligners I would like to support:

PeteHaitch commented 10 years ago

Notes on BSmooth

These notes are copied from the now-closed https://github.com/PeteHaitch/comethylation/issues/19

SAM/BAM files created by BSmooth contain the read sequence in the YO tag rather than in the SEQ field. In fact, the SEQ field contains the C-to-T converted sequence.

More generally, it appears the SAM/BAM files created by bsmooth may not comply with SAM specifications.

When using the --fr flag to indicate that reads should be in FR orientation (the default for Illumina's paired-end sequencing protocol) the FLAG is set to either 67 or 131. This is not what I expect. I would expect 99 or 147. Need to check my reasoning by eyeballing some reads in IGV.

PeteHaitch commented 10 years ago

General notes on "Bismark-ifying" SAM/BAM files

These notes are copied from the now-closed https://github.com/PeteHaitch/comethylation/issues/30

The XM-tag should be of the same length as SEQ (and QUAL). The SEQ field includes soft-clipped bases and so the corresponding entry of the XM-tag should be .. For inserted bases, the corresponding entry of the XM-tag should also be . because we can't infer it from the reference sequence alone. Technically, we could try to infer this from the read sequence, e.g. a C insertion followed by a G reference match could be evidence of a methylated CG, however, Bismark sets the XM-tag entry to . (based on setting the corresponding genomic sequence to X, which ensures no match between the read and the "reference" at an insertion).

These notes are copied from the now-closed https://github.com/PeteHaitch/comethylation/issues/73

Each read will need an XM-tag, XR-tag and XG-tag. The XM-tag must be as long as the SEQ field. Therefore, soft-clipped positions, as identified by the CIGAR string, should have the corresponding entry in the XM-tag set to .. Bases that are hard-clipped should never appear in the SEQ field.

May need to check the orientation of paired-end reads because Bismark only allows FR orientation (and also permits for a certain amount of overlap), e.g..:

-----> 
    <---- 

Other aligners might allow FF, RF or RR orientation; will these need to be handled separately?

PeteHaitch commented 8 years ago

Bismark remains the dominant aligner in the field. Supporting other aligners is a low priority for me, but pull requests are welcome.