Closed iljungr closed 4 years ago
Here's another example of weird behavior for a reference gap. This is from a maf file extracted from the same hal file as the original issue using the same parameters except with BTERRESTRIS as the reference. Species other than BTERRESTRIS and AMELLIFERA are not shown here and the final block has been truncated, but a file with the three blocks in full is attached.
s BTERRESTRIS.NC_015773.1 4273841 8 + 12868931 attaacaa
s AMELLIFERA.CM009942.2 3993723 8 + 11514234 ATTAATGA
a
s BTERRESTRIS.NC_015773.1 4273849 0 + 12868931 ---
s AMELLIFERA.CM009942.2 3993720 3 + 11514234 AAT
a
s BTERRESTRIS.NC_015773.1 4273849 908 + 12868931 aacaaaacataGAGAGAAGTAA...
s AMELLIFERA.CM009942.2 3993731 931 + 11514234 CACCAAATACAGGGGAAAGTAA...
The first block covers positions 3993723-3993730 in AMELLIFERA. The third block starts off at the next position, 3993731, exactly where it should if there were nothing intervening, but somehow the aligner decided to add a reference gap aligned to three AMELLIFERA bases from 11 positions earlier. This makes no sense. The middle block only includes BTERRESTRIS and the four Apis species, all of which have AAT 11 bases offset. temp.txt
When hal2maf is used with the --maxRefGap option, in blocks in which the reference species has only gaps the aligned sequences in the other species are on the wrong strand, i.e., the opposite strand from the sequences in the adjacent blocks. I show an example below. I'm guessing this is a problem with hal2maf rather than the original hal file. This happens in all the cases I looked at for the hal file that I'm using, making the --maxRefGap option pretty much useless. I need this option because I am using the alignment to investigate protein-coding potential, and insertions in other species could contain stop codons, which are currently hidden by having the wrong strand.
Here's an example. These are the AMELLIFERA and BTERRESTRIS sequences of three consecutive blocks in the maf file produced by "
hal2maf --refGenome AMELLIFERA --maxRefGap 50 --noAncestors --onlyOrthologs bboutput.hal amel.maf
". The other species have the same problem, but have been removed for clarity.The BTERRESTRIS sequence in the middle block is in the interval (1418797, 1418803) relative to the '-' strand, which is the same chromosome position as positions (5274633-1418797-7, 5274633-1418797-1) = (3855829, 3855835) relative to the '+' strand. This fits snuggly between the two blocks around it, just as it should, but it should be on the '+' strand instead of the '-' strand.
If the three blocks are combined, we get this alignment:
However, the BTERRESTRIS genome sequence is
attgtttaatt attgctG AAAGATA
. The middle section is the reverse complement of what the maf file has (attgctG
instead ofCagcaat
).If the problem is with hal2maf then I would expect it to be reproducible using a simple example constructed for the purpose. If not, and my input file bboutput.hal is needed, I will have to get permission to make it available because the alignment hasn't been published yet. The version of hal2maf I am using is 531758b3553c881b296c82b659654feb50a4c773, last commit May 19, 2019.