annaorteu / wrath

Wrath: WRapped Analysis of Tagged Haplotypes
GNU General Public License v3.0
8 stars 3 forks source link

Update wrath to allow for species with integer chromosome names #11

Open gmkov opened 2 months ago

gmkov commented 2 months ago

Wrath fails at jaccard matrix step because BX tag extraction done incorrectly when species reference genome has chromosome names that are just numbers (e.g. 1 - 31) as opposed to characters+ numbers (e.g. Herato0204).

To extract barcodes, wrath was using grep to first find the chromosome column in the bam files and keep everything after that match up until the BX tag -> grep -o -P "${chromosome}.BX:Z:[^\t\n]" . However, if "chromosome" is a number, the match might occur in the QNAME field (first field of a bam file), and then the barcode extraction would pick up the wrong columns.

Changed so that grep takes everything before the BX tag field, and then awk only takes the relevant columns (chr , pos, pos, bx tag). This should be generalisable as long as bam files all have the same format (third field= RNAME), no matter the chromosome name format