Hello

I've identified an issue when reference genomes have chromosomes named as integers (e.g. 2), rather than with the usual character+number formulation (e.g. Herato02)

EXAMPLE ISSUE

Running wrath on chromosome 1 and chromosome 26 with the following command: wrath -l -g ${REF} -c ${chr} -w ${winSize} -a ${BAM_LIST} -t ${threads}

chr 26 Chromosome 26 run until the end with no issues. Head of barcodes extracted, looks fine:

head -n1  26/wrath_out/beds/barcodes_26_x_x_sorted_x.bed.gz
26 5124966 5124966 BX:Z:A64C78B74D01

chr 1 Chromosome 1 run until the jaccard matrix step, then throws an ugly tabix error (but wrath does not crash, it remains stuck not doing anything on that step). The error's last line is " File "pysam/libctabix.pyx", line 507, in pysam.libctabix.TabixFile.fetch. ValueError: could not create iterator for region '1:40001-50000'"

But if you go a step back and look at the barcodes extracted:

head -n1  1/wrath_out/beds/barcodes_1_x_x_sorted_x.bed.gz
1641:37:HGLHMDSX2:4:1101:10420:13996 163 163 BX:Z:A96C20B95D71

🚩looks wrong, first field should be the chromosome but it's the (partial) QNAME field tag of the bam file.

WHY IS THE ISSUE HAPPENING

The wrath line in trouble is:

samtools view -q 20 -@ ${threads} ${sample} ${chromosome}:${start}-${end} | 
grep -o -P "${chromosome}.*BX:Z:[^\t\n]*" | 
awk '{print $1"\t"$2"\t"$2"\t"$NF}'

What is it doing? grep is looking for a match in each row with chromosome name, then keeping everything after that match and before BX:Z.

For chromosome 26 that works, because there arent any 26 ahead of the chr column. But for chr 1 it takes the wrong column, as there are 1s before the chr column. In bold are the strings kept for the barcode file for chr 26 and chr 1

chr26 - success

26 5124966 5124966 BX:Z:A64C78B74D01

A01641:37:HGLHMDSX2:4:1568:27398:4664 99 ➡️26⬅️ ➡️5124966⬅️ 60 3S118M1D28M = 5125190 341 CACCCAGACTGCGGGCTGCTTTGTGGAAGTTTCTAAAACCCACAAAGTGTTTTCATCCCGACCCGGAAATCGAACCCGAGACCTCGTGCTCAGCAGCCAAACTTGCGACAACTAGATCAACAGGCAGTCTAATTACAGATTACCTAATA FFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFF:FFFF:FFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFF,:FFFFFFFF MD:Z:7T55G1G30C21^G2A23A1 PG:Z:MarkDuplicates RG:Z:CAM046087 NM:i:7 AS:i:112 XS:i:68 ➡️BX:Z:A54C03B89D71⬅️ PX:Z:GTCGAACAATCTCGTAT QX:Z:FFFF:FFFF:FFF:FFFFFFFFFFFFFF RX:Z:CGAGATCGAAGAGACGGAGACTCTCGAA

chr1 - fail, keeping QNAME field instead of chr name RNAME field

1641:37:HGLHMDSX2:4:1101:10420:13996 5124892 5124892 BX:Z:A96C20B95D71

A0➡️1641:37:HGLHMDSX2:4:1110:27109:5478⬅️ ➡️163⬅️ 1 5124892 60 111M5S = 5124974 237 CACTTAGCTCGTTCATACAAACTGTAAACATTAGGCCGTGACTAAGGACTTCCTACGTAAATATTTTCTATAGGTACTACTAACGTAATTAGTGATCTCACAACGATATTTCCCCA FFF:FFFFFFFFFFFFFFFFFF:FFFFFFF,FFFFFFFFFFFFFFFFFFFFFFF:FF,FFFFFFFFFFF,F:FFF,FFFFFFFFFFFFFFF,FFFFFFFFFFFFFFF,FFF,FF,F MD:Z:1G88G0T19 PG:Z:MarkDuplicates RG:Z:CAM046087 NM:i:3 AS:i:99 XS:i:0 ➡️BX:Z:A96C20B95D71 ⬅️:Z:GTCGAACAATCTCGGTT QX:Z:FFFF:FFFFFFFFFFFFFFFFFFFFFFF RX:Z:CGAGATCCCGTAAAAGAGCTCCACATGG

And then it goes on to make windows for jaccard and freaks out because the positions it expects are not there (well, they are there, but labelled incorrectly)

SOLUTION

Change from:

    samtools view -q 20 -@ ${threads} ${sample} ${chromosome}:${start}-${end} |
    grep -o -P "${chromosome}.*BX:Z:[^\t\n]*" |
    awk '{print $1"\t"$2"\t"$2"\t"$NF}' > wrath_out/beds/barcodes_${chromosome}_${start}_${end}_$(basename "$group" .txt)_$(basename $sample .bam).bed

To:

    samtools view -q 20 -@ ${threads} ${sample} ${chromosome}:${start}-${end} |
    grep -o -P ".*BX:Z:[^\t\n]*" |
    awk '{print $3"\t"$4"\t"$4"\t"$NF}' > wrath_out/beds/barcodes_${chromosome}_${start}_${end}_$(basename "$group" .txt)_$(basename $sample .bam).bed

I've made a pull request that edits these lines as bove to avoid using a match to chromosome with grep, and instead use awk to take the relevant columns. Note that samtools view is already subsetting if wrath is to be run in only one chromosome or region "samtools view -q 20 -@ ${threads} ${sample} ${chromosome}:${start}-${end} | " so not necessary to find matches by chromosome name. https://github.com/annaorteu/wrath/pull/11

hope this makes sense. thank you!

PS: I dont fully understand why keep the bam position column twice for the barcodes table (original line: awk '{print $1"\t"$2"\t"$2"\t"$NF}'). perhaps this could be simplified to one copy of the column only, but havent tested how it would affect subsequent scripts

annaorteu / wrath

Wrath likely to fail if reference genome chromosome names are integers #12

EXAMPLE ISSUE

WHY IS THE ISSUE HAPPENING

chr26 - success

chr1 - fail, keeping QNAME field instead of chr name RNAME field

SOLUTION