ComparativeGenomicsToolkit / taffy

This is a library C/Python/CLI for working with TAF (.taf,.taf.gz) and MAF (.maf) alignment files
MIT License
23 stars 3 forks source link

Support q-lines in MAF #39

Open glennhickey opened 7 months ago

glennhickey commented 7 months ago

MAF q-lines (news to me) are defined here

 Lines starting with "q" -- information about the quality of each aligned base for the species

 s hg18.chr1                  32741 26 + 247249719 TTTTTGAAAAACAAACAACAAGTTGG
 s panTro2.chrUn            9697231 26 +  58616431 TTTTTGAAAAACAAACAACAAGTTGG
 q panTro2.chrUn                                   99999999999999999999999999
 s dasNov1.scaffold_179265     1474  7 +      4584 TT----------AAGCA---------
 q dasNov1.scaffold_179265                         99----------32239--------- 

This PR adds support for them in TAF. This is done by using column tags in TAF with key == "q" and value of a string of ascii-phred values (min=0=!, max=93=~). This full spectrum can't be represented in MAF, which only supports 10 different values.

Also, since everything needs to be transposed all the time, it assumes there's a score for every position. If there isn't, a default of max-score is used.

glennhickey commented 6 months ago

The ASCII phred's weren't working in practice because they could include : which is a reserved character in the TAF tag format. So I've switched it to use vg's base64 encoding of the raw 1-byte quality values.