A question about halLiftover

Marh32 commented 6 months ago

Hi,

I'm sorry to bother you. I was a little confused when using halLiftover. When using halLiftover to locate the corresponding conserved element (e.g., Conserved Element 1) in the target genome based on annotations from the reference genome, I occasionally receive empty results. How can I ascertain whether this outcome is due to the genuine absence of this element in the corresponding region of the target genome (Case 1) or because of issues such as poor assembly quality of the target genome, leading to the entire region not aligning properly (Case 2)? Thank you so much for your help.

Best regards, Hao

glennhickey commented 6 months ago

There's no easy way to check this from the cactus output. You could

make a pairwise alignment, maybe with another tool, and if it's aligned there that would be evidence it's a missed alignment (otherwise it would be more evidence of an assembly issue)
export a MAF of the region with cactus-hal2maf using --maximumGapLength big enough to span your gap (note: this won't work for very big gaps). If you see a big insertion and deletion in the MAF, that'd be a sign of an under alignment.
Do a liftover of the element and its flanking regions, to see if the flanking regions are presetn in the target genome. if they are, and there is some sequence in between, you can manually compare it to your missing element...

Marh32 commented 6 months ago

Ok.Thank you so much for your reply. Do I have any tools can extract the specific region of hal file to fasta file format(retain alignment information)?

Marh32 commented 6 months ago

In addition, why can a single line in a BED file correspond to multiple alignment results?Does this indicate that a single contig in the BED file aligns to multiple regions?

My understanding is that HAL files are indeed derived from constructing a homology map based on anchors produced by tools like LASTZ during whole-genome alignments, eventually leading to the formation of full-genome comparisons. If an element in a BED file does not reside within a block, it should return an empty result, whereas if it's within a block, it should return a unique mapping result. Why would there be a situation where multiple results are returned?

glennhickey commented 6 months ago

If there's one copy of gene A in species 1 and two copies in species 2, then then all three copies will (probably) be aligned together in Cactus. Due to such paralogous relationships, you can expect a given query region to map to multiple reference regions. There's a tool, halSynteny that tries to filter this somewhat. You can run it yourself or within cactus-hal2chains

Marh32 commented 6 months ago

Thank you so much for your reply. I have try to use halSynteny to filter it. And I get the results as follow:

In this situation, should the alignment result of the third line be considered error or attributed to such paralogous relationships? Consequently, when searching for orthologous genes or conserved elements, should I indeed filter out these alignment outcomes(like thrid line)? Also, I find that there is some missing alignment information between blocks in the returned result (such as from 30317824 (in the first row) to 30323901 (in the second row)), is there any way I can get this missing alignment information? Thank you for your help

ComparativeGenomicsToolkit / hal

A question about halLiftover #301