ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
505 stars 111 forks source link

Locations When Using cactus-hal2maf #1462

Closed parsonsrachel closed 1 month ago

parsonsrachel commented 1 month ago

I have been using the cactus-hal2maf tool to look at orthologous regions in the genomes of a set of primates, but I noticed some interesting behavior that I don't really understand. I wanted to pull out all the regions that aligned to chromosome 10 in chimp and look through the alignment (specifically focusing on repetitive elements). I used this command to generate the maf file:

cactus-hal2maf --chunkSize 500000 --refGenome GCA_028858775.2 --refSequence CM054444.2 --start 1 --length 140160334 --targetGenomes GCA_029281585.2,GCA_028885655.2,GCA_028885625.2,GCA_029289425.2,GCA_028878055.2,GCA_028858775.2,hs1 --noAncestors --coverage --filterGapCausingDupes --dupeMode single --binariesMode singularity file:jobstore.txt 7-t2t-apes-2023v2.hal chr10_PanTro.maf

(cactus version 7.0.0 and singularity (apptainer) version 1.3.3-1.el8)

When I was looking at certain regions, I noticed that the location reported for some of the species did not match the sequence that was being aligned. Specifically, I was looking at the region in chimp from position 28234 with length 668:

Screenshot 2024-08-13 at 9 38 15 AM

At this location in chimp (specifically starting at position 28528), there is a transposable element insertion, and I was looking to see if there was an orthologous TE found in the other species. Since these regions are well aligned, it suggests that the other species should also have the TE at this location, but the annotation is missing from some of the other species. To try and figure out if there was an issue with my repeat annotations, I looked at the reported locations for each species to see what could be happening. When I did this I noticed that for 5 of the 6 target genomes, the sequence aligned in the maf file did not match the sequence present in the genome at the reported location. For example, in human the reported sequence from the maf file is "tttttaattttttgaggaatggtcatactgctttccataatggcta..." but the actual sequence at location 115,464,920 is "CTCAAGCAATCCTCCCG..." (shown below).

Screenshot 2024-08-13 at 9 27 04 AM

Additionally, since the maf file reported the sequence in human was the reverse complement (denoted by -), we also looked at the reported location with that in mind. To do this I looked at the end of the range reported in the maf file. The range given by the maf for the human genome was 115,464,920-115,465,587, so I looked from 115,465,487-115,465,587 and got the sequence "...CCCATTTATTT" which if reverse complemented is "...GGGTAAATAAA" and does not match the maf file either. Below is a screenshot of this sequence range.

Screenshot 2024-08-13 at 1 39 03 PM

The only species with the correct location of its reported sequence was GCA_029289425.2 (bonobo). The other species had the same issue as human where the sequence in the maf file was not the sequence at the reported location.

We also compared the location in the maf file to the location reported by the halLiftover tool (version 2.2). To do this I created a bed file of the region of interest with one line "CM054444.2 28234 28902". Then I used the command: halLiftover <halfile> GCA_028858775.2 <bed file with location> hs1 hs1.bed Then I used bedtools to sort and then merge: bedtools sort -i hs1.bed > hs1_sorted.bed bedtools merge -i hs1_sorted.bed -d 20 > hs1_sorted_merge.bed The resulting location from this tool was chr10 19,292,547-19,293,214.

This is drastically different from the location reported by the maf file for human (chr10:115,464,920-115,465,587). When I look at the sequence at the halLiftover coordinate ("...CCTCAAAAAATTAAAAA"), I get the sequence that was reported under human in the maf file (if you take the reverse complement).

Screenshot 2024-08-13 at 1 44 33 PM

This makes it seem like the coordinate reported by halLiftover is the coordinate that should be reported in the maf file, and there may be some error with which location is reported (even when the sequence is reported correctly).

I was wondering if there is any user error that may be contributing to this confusing behavior.

Thanks in advance!

parsonsrachel commented 1 month ago

After looking more into this it appears I was interpreting the maf file incorrectly. The locations reported for the species matching the reverse strand (lines with -) were being reported from the end of the sequence (as described here). In order to get the coordinates in the format reported by halLiftover you have to take the (length of the sequence - reported location - length of aligned block) to get the start coordinate of the block. For human this would be 134,758,134 - 115,464,920 - 667 = 19,292,547. Using this transform gets the coordinates in a format to look from the start of the sequence forward instead of from the end of the sequence.