ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
529 stars 111 forks source link

hal2maf always aligns blocks of reference gaps to the wrong strand #83

Closed iljungr closed 4 years ago

iljungr commented 5 years ago

When hal2maf is used with the --maxRefGap option, in blocks in which the reference species has only gaps the aligned sequences in the other species are on the wrong strand, i.e., the opposite strand from the sequences in the adjacent blocks. I show an example below. I'm guessing this is a problem with hal2maf rather than the original hal file. This happens in all the cases I looked at for the hal file that I'm using, making the --maxRefGap option pretty much useless. I need this option because I am using the alignment to investigate protein-coding potential, and insertions in other species could contain stop codons, which are currently hidden by having the wrong strand.

Here's an example. These are the AMELLIFERA and BTERRESTRIS sequences of three consecutive blocks in the maf file produced by "hal2maf --refGenome AMELLIFERA --maxRefGap 50 --noAncestors --onlyOrthologs bboutput.hal amel.maf". The other species have the same problem, but have been removed for clarity.

a
s       AMELLIFERA.CM009939.2   543957  11      +       12354651        tttatttaatt
s       BTERRESTRIS.NC_015777.1 3855818 11      +       5274633 attgtttaatt

a
s       AMELLIFERA.CM009939.2   543968  0       +       12354651        -------
s       BTERRESTRIS.NC_015777.1 1418797 7       -       5274633 Cagcaat

a
s       AMELLIFERA.CM009939.2   543968  7       +       12354651        aaagctA
s       BTERRESTRIS.NC_015777.1 3855836 7       +       5274633 AAAGATA

The BTERRESTRIS sequence in the middle block is in the interval (1418797, 1418803) relative to the '-' strand, which is the same chromosome position as positions (5274633-1418797-7, 5274633-1418797-1) = (3855829, 3855835) relative to the '+' strand. This fits snuggly between the two blocks around it, just as it should, but it should be on the '+' strand instead of the '-' strand.

If the three blocks are combined, we get this alignment:

AMELLIFERA  tttatttaatt-------aaagctA
BTERRESTRIS attgtttaattCagcaatAAAGATA

However, the BTERRESTRIS genome sequence is attgtttaatt attgctG AAAGATA. The middle section is the reverse complement of what the maf file has (attgctG instead of Cagcaat).

If the problem is with hal2maf then I would expect it to be reproducible using a simple example constructed for the purpose. If not, and my input file bboutput.hal is needed, I will have to get permission to make it available because the alignment hasn't been published yet. The version of hal2maf I am using is 531758b3553c881b296c82b659654feb50a4c773, last commit May 19, 2019.

iljungr commented 5 years ago

Here's another example of weird behavior for a reference gap. This is from a maf file extracted from the same hal file as the original issue using the same parameters except with BTERRESTRIS as the reference. Species other than BTERRESTRIS and AMELLIFERA are not shown here and the final block has been truncated, but a file with the three blocks in full is attached.

s       BTERRESTRIS.NC_015773.1 4273841 8       +       12868931        attaacaa
s       AMELLIFERA.CM009942.2   3993723 8       +       11514234        ATTAATGA

a
s       BTERRESTRIS.NC_015773.1 4273849 0       +       12868931        ---
s       AMELLIFERA.CM009942.2   3993720 3       +       11514234        AAT

a
s       BTERRESTRIS.NC_015773.1 4273849 908     +       12868931        aacaaaacataGAGAGAAGTAA...
s       AMELLIFERA.CM009942.2   3993731 931     +       11514234        CACCAAATACAGGGGAAAGTAA...

The first block covers positions 3993723-3993730 in AMELLIFERA. The third block starts off at the next position, 3993731, exactly where it should if there were nothing intervening, but somehow the aligner decided to add a reference gap aligned to three AMELLIFERA bases from 11 positions earlier. This makes no sense. The middle block only includes BTERRESTRIS and the four Apis species, all of which have AAT 11 bases offset. temp.txt