Closed PeteHaitch closed 10 years ago
XM
-tag is of the wrong typeI just found another bug. Some of the files -- ADS
, ADS-adipose
and ADS-iPSC
contain readpairs where either read_1
or read_2
is of length 1. This means the XM
-tag is also a single character, e.g. XM:A:.
. At some point in the conversion process (or afterwards during duplicate marking, sorting, etc.) the XM
-tag gets stored as XM:A:.
rather than XM:Z:.
. A
stands for character and Z
stands for character string in the SAMtools spec (http://samtools.github.io/hts-specs/SAMv1.pdf).
The problem is that the bismark_methylation_extractor
expects it to be of the form XM:Z:<tag>
and so it breaks on these XM:A:<tag>
values.
This patch to bismark_methylation_extractor
(v0.10.1
) will fix these issues and generally make bismark_methylation_extractor
more robust.
1779c1779
< if (/(^\@[A-Za-z][A-Za-z](\t+[A-Za-z][A-Za-z0-9]:[ -~]+)+$)|(^\@CO\t.*)/) { # skipping header lines (REGEX from SAM specifications document (v1) http://samtools.github.io/hts-specs/SAMv1.pdf)
---
> if (/^\@/) { # skipping header lines (starting with @)
1803c1803
< while ( /(XM|XR|XG):[Z|A]:([^\t]+)/g ) {
---
> while ( /(XM|XR|XG):Z:([^\t]+)/g ) {
1821d1820
< chomp $read_conversion;
2019c2018
< if (/(^\@[A-Za-z][A-Za-z](\t+[A-Za-z][A-Za-z0-9]:[ -~]+)+$)|(^\@CO\t.*)/) { # skipping header lines (REGEX from SAM specifications document (v1) http://samtools.github.io/hts-specs/SAMv1.pdf)
---
> if (/^\@/) { # skipping header lines (starting with @)
2038c2037
< while ( /(XM|XR|XG):[Z|A]:([^\t]+)/g ) {
---
> while ( /(XM|XR|XG):Z:([^\t]+)/g ) {
2040a2040
>
2066,2067c2066,2067
< while ( /(XM|XR):[Z|A]:([^\t]+)/g ) {
< my $tag = $1;
---
> while ( /(XM|XR):Z:([^\t]+)/g ) {
> my $tag = $1;
2068a2069
>
2080,2081d2080
< chomp $first_read_conversion; # in case it captured a new line character
< chomp $second_read_conversion; # in case it captured a new line character
There are two separate issues at play:
Failure to chomp whitespace from
XR
tagbismark_methylation_extractor
will fail on some files, e.g.ADS
,ADS-adipose
andADS-iPSC
with the errorUnexpected combination of read and genome conversion:
.This is because the whitespace at the end of the
XR
string is not being chomped off.QNAME
beginning with@
bismark_methylation_extractor
will fail if theQNAME
begins with the@
character and the input file is aSAM/BAM
. This is becausebismark_methylation_extractor
skips all lines beginning with@
under the assumption that these correspond to the header lines of theSAM/BAM
file.When data are aligned with
bismark
this isn't a problem because it removes@
characters from the start of theQNAME
field on-the-fly. However, some samples processed with theLister2BAM
scripts, e.g.FF
,FF-iPSC-19.11+BMP4
,HSF1
andH9_Laurent
, will retain@
characters at the beginning of theQNAME
field because these data weren't actually aligned withbismark
but were rather converted to be bismark-like files.