PeteHaitch / Lister2BAM

A collection of Python scripts to convert Lister-style alignment files from Lister et al. Nature (2009 and 2011) to BAM format.

MIT License

0 stars 0 forks source link

There are two separate issues at play:

Failure to chomp whitespace from `XR` tag

bismark_methylation_extractor will fail on some files, e.g. ADS, ADS-adipose and ADS-iPSC with the error Unexpected combination of read and genome conversion:.

This is because the whitespace at the end of the XR string is not being chomped off.

`QNAME` beginning with `@`

bismark_methylation_extractor will fail if the QNAME begins with the @ character and the input file is a SAM/BAM. This is because bismark_methylation_extractor skips all lines beginning with @ under the assumption that these correspond to the header lines of the SAM/BAM file.

When data are aligned with bismark this isn't a problem because it removes @ characters from the start of theQNAME field on-the-fly. However, some samples processed with the Lister2BAM scripts, e.g. FF, FF-iPSC-19.11+BMP4, HSF1 and H9_Laurent, will retain @ characters at the beginning of the QNAME field because these data weren't actually aligned with bismark but were rather converted to be bismark-like files.

XM-tag is of the wrong type

I just found another bug. Some of the files -- ADS, ADS-adipose and ADS-iPSC contain readpairs where either read_1 or read_2 is of length 1. This means the XM-tag is also a single character, e.g. XM:A:.. At some point in the conversion process (or afterwards during duplicate marking, sorting, etc.) the XM-tag gets stored as XM:A:. rather than XM:Z:.. A stands for character and Z stands for character string in the SAMtools spec (http://samtools.github.io/hts-specs/SAMv1.pdf).

The problem is that the bismark_methylation_extractor expects it to be of the form XM:Z:<tag> and so it breaks on these XM:A:<tag> values.

Patch to fix these issues

This patch to bismark_methylation_extractor (v0.10.1) will fix these issues and generally make bismark_methylation_extractor more robust.

1779c1779 < if (/(^\@[A-Za-z][A-Za-z](\t+[A-Za-z][A-Za-z0-9]:[ -~]+)+$)|(^\@CO\t.*)/) { # skipping header lines (REGEX from SAM specifications document (v1) http://samtools.github.io/hts-specs/SAMv1.pdf) --- > if (/^\@/) { # skipping header lines (starting with @) 1803c1803 < while ( /(XM|XR|XG):[Z|A]:([^\t]+)/g ) { --- > while ( /(XM|XR|XG):Z:([^\t]+)/g ) { 1821d1820 < chomp $read_conversion; 2019c2018 < if (/(^\@[A-Za-z][A-Za-z](\t+[A-Za-z][A-Za-z0-9]:[ -~]+)+$)|(^\@CO\t.*)/) { # skipping header lines (REGEX from SAM specifications document (v1) http://samtools.github.io/hts-specs/SAMv1.pdf) --- > if (/^\@/) { # skipping header lines (starting with @) 2038c2037 < while ( /(XM|XR|XG):[Z|A]:([^\t]+)/g ) { --- > while ( /(XM|XR|XG):Z:([^\t]+)/g ) { 2040a2040 > 2066,2067c2066,2067 < while ( /(XM|XR):[Z|A]:([^\t]+)/g ) { < my $tag = $1; --- > while ( /(XM|XR):Z:([^\t]+)/g ) { > my $tag = $1; 2068a2069 > 2080,2081d2080 < chomp $first_read_conversion; # in case it captured a new line character < chomp $second_read_conversion; # in case it captured a new line character

PeteHaitch / Lister2BAM

bismark_methylation_extractor can break on some files #1

Failure to chomp whitespace from `XR` tag

`QNAME` beginning with `@`

`XM`-tag is of the wrong type

Patch to fix these issues

PeteHaitch / Lister2BAM

bismark_methylation_extractor can break on some files #1

Failure to chomp whitespace from XR tag

QNAME beginning with @

XM-tag is of the wrong type

Patch to fix these issues

Failure to chomp whitespace from `XR` tag

`QNAME` beginning with `@`

`XM`-tag is of the wrong type