Closed parsonsrachel closed 3 months ago
After looking more into this it appears I was interpreting the maf file incorrectly. The locations reported for the species matching the reverse strand (lines with -) were being reported from the end of the sequence (as described here). In order to get the coordinates in the format reported by halLiftover you have to take the (length of the sequence - reported location - length of aligned block) to get the start coordinate of the block. For human this would be 134,758,134 - 115,464,920 - 667 = 19,292,547. Using this transform gets the coordinates in a format to look from the start of the sequence forward instead of from the end of the sequence.
I have been using the cactus-hal2maf tool to look at orthologous regions in the genomes of a set of primates, but I noticed some interesting behavior that I don't really understand. I wanted to pull out all the regions that aligned to chromosome 10 in chimp and look through the alignment (specifically focusing on repetitive elements). I used this command to generate the maf file:
(cactus version 7.0.0 and singularity (apptainer) version 1.3.3-1.el8)
When I was looking at certain regions, I noticed that the location reported for some of the species did not match the sequence that was being aligned. Specifically, I was looking at the region in chimp from position 28234 with length 668:
At this location in chimp (specifically starting at position 28528), there is a transposable element insertion, and I was looking to see if there was an orthologous TE found in the other species. Since these regions are well aligned, it suggests that the other species should also have the TE at this location, but the annotation is missing from some of the other species. To try and figure out if there was an issue with my repeat annotations, I looked at the reported locations for each species to see what could be happening. When I did this I noticed that for 5 of the 6 target genomes, the sequence aligned in the maf file did not match the sequence present in the genome at the reported location. For example, in human the reported sequence from the maf file is "tttttaattttttgaggaatggtcatactgctttccataatggcta..." but the actual sequence at location 115,464,920 is "CTCAAGCAATCCTCCCG..." (shown below).
Additionally, since the maf file reported the sequence in human was the reverse complement (denoted by -), we also looked at the reported location with that in mind. To do this I looked at the end of the range reported in the maf file. The range given by the maf for the human genome was 115,464,920-115,465,587, so I looked from 115,465,487-115,465,587 and got the sequence "...CCCATTTATTT" which if reverse complemented is "...GGGTAAATAAA" and does not match the maf file either. Below is a screenshot of this sequence range.
The only species with the correct location of its reported sequence was GCA_029289425.2 (bonobo). The other species had the same issue as human where the sequence in the maf file was not the sequence at the reported location.
We also compared the location in the maf file to the location reported by the halLiftover tool (version 2.2). To do this I created a bed file of the region of interest with one line "CM054444.2 28234 28902". Then I used the command:
halLiftover <halfile> GCA_028858775.2 <bed file with location> hs1 hs1.bed
Then I used bedtools to sort and then merge:bedtools sort -i hs1.bed > hs1_sorted.bed
bedtools merge -i hs1_sorted.bed -d 20 > hs1_sorted_merge.bed
The resulting location from this tool was chr10 19,292,547-19,293,214.This is drastically different from the location reported by the maf file for human (chr10:115,464,920-115,465,587). When I look at the sequence at the halLiftover coordinate ("...CCTCAAAAAATTAAAAA"), I get the sequence that was reported under human in the maf file (if you take the reverse complement).
This makes it seem like the coordinate reported by halLiftover is the coordinate that should be reported in the maf file, and there may be some error with which location is reported (even when the sequence is reported correctly).
I was wondering if there is any user error that may be contributing to this confusing behavior.
Thanks in advance!