PeteHaitch / Lister2BAM

A collection of Python scripts to convert Lister-style alignment files from Lister et al. Nature (2009 and 2011) to BAM format.
MIT License
0 stars 0 forks source link

bismark_methylation_extractor can break on some files #1

Closed PeteHaitch closed 10 years ago

PeteHaitch commented 10 years ago

There are two separate issues at play:

Failure to chomp whitespace from XR tag

bismark_methylation_extractor will fail on some files, e.g. ADS, ADS-adipose and ADS-iPSC with the error Unexpected combination of read and genome conversion:.

This is because the whitespace at the end of the XR string is not being chomped off.

QNAME beginning with @

bismark_methylation_extractor will fail if the QNAME begins with the @ character and the input file is a SAM/BAM. This is because bismark_methylation_extractor skips all lines beginning with @ under the assumption that these correspond to the header lines of the SAM/BAM file.

When data are aligned with bismark this isn't a problem because it removes @ characters from the start of theQNAME field on-the-fly. However, some samples processed with the Lister2BAM scripts, e.g. FF, FF-iPSC-19.11+BMP4, HSF1 and H9_Laurent, will retain @ characters at the beginning of the QNAME field because these data weren't actually aligned with bismark but were rather converted to be bismark-like files.

PeteHaitch commented 10 years ago

XM-tag is of the wrong type

I just found another bug. Some of the files -- ADS, ADS-adipose and ADS-iPSC contain readpairs where either read_1 or read_2 is of length 1. This means the XM-tag is also a single character, e.g. XM:A:.. At some point in the conversion process (or afterwards during duplicate marking, sorting, etc.) the XM-tag gets stored as XM:A:. rather than XM:Z:.. A stands for character and Z stands for character string in the SAMtools spec (http://samtools.github.io/hts-specs/SAMv1.pdf).

The problem is that the bismark_methylation_extractor expects it to be of the form XM:Z:<tag> and so it breaks on these XM:A:<tag> values.

PeteHaitch commented 10 years ago

Patch to fix these issues

This patch to bismark_methylation_extractor (v0.10.1) will fix these issues and generally make bismark_methylation_extractor more robust.

1779c1779
<       if (/(^\@[A-Za-z][A-Za-z](\t+[A-Za-z][A-Za-z0-9]:[ -~]+)+$)|(^\@CO\t.*)/) {          # skipping header lines (REGEX from SAM specifications document (v1) http://samtools.github.io/hts-specs/SAMv1.pdf)
---
>       if (/^\@/) {         # skipping header lines (starting with @)
1803c1803
<   while ( /(XM|XR|XG):[Z|A]:([^\t]+)/g ) {
---
>       while ( /(XM|XR|XG):Z:([^\t]+)/g ) {
1821d1820
<   chomp $read_conversion;
2019c2018
<   if (/(^\@[A-Za-z][A-Za-z](\t+[A-Za-z][A-Za-z0-9]:[ -~]+)+$)|(^\@CO\t.*)/) {      # skipping header lines (REGEX from SAM specifications document (v1) http://samtools.github.io/hts-specs/SAMv1.pdf)
---
>       if (/^\@/) {         # skipping header lines (starting with @)
2038c2037
<       while ( /(XM|XR|XG):[Z|A]:([^\t]+)/g ) {
---
>       while ( /(XM|XR|XG):Z:([^\t]+)/g ) {
2040a2040
> 
2066,2067c2066,2067
<       while ( /(XM|XR):[Z|A]:([^\t]+)/g ) {    
<         my $tag = $1;
---
>       while ( /(XM|XR):Z:([^\t]+)/g ) {
>         my $tag = $1;
2068a2069
> 
2080,2081d2080
<   chomp $first_read_conversion; # in case it captured a new line character
<   chomp $second_read_conversion; # in case it captured a new line character