ComparativeGenomicsToolkit / hal

Hierarchical Alignment Format
Other
164 stars 39 forks source link

A question about halLiftover #301

Open Marh32 opened 6 months ago

Marh32 commented 6 months ago

Hi,

I'm sorry to bother you. I was a little confused when using halLiftover. When using halLiftover to locate the corresponding conserved element (e.g., Conserved Element 1) in the target genome based on annotations from the reference genome, I occasionally receive empty results. How can I ascertain whether this outcome is due to the genuine absence of this element in the corresponding region of the target genome (Case 1) or because of issues such as poor assembly quality of the target genome, leading to the entire region not aligning properly (Case 2)? Thank you so much for your help.

Best regards, Hao

Picture1

glennhickey commented 6 months ago

There's no easy way to check this from the cactus output. You could

Marh32 commented 6 months ago

Ok.Thank you so much for your reply. Do I have any tools can extract the specific region of hal file to fasta file format(retain alignment information)?

Marh32 commented 6 months ago

In addition, why can a single line in a BED file correspond to multiple alignment results?Does this indicate that a single contig in the BED file aligns to multiple regions?

My understanding is that HAL files are indeed derived from constructing a homology map based on anchors produced by tools like LASTZ during whole-genome alignments, eventually leading to the formation of full-genome comparisons. If an element in a BED file does not reside within a block, it should return an empty result, whereas if it's within a block, it should return a unique mapping result. Why would there be a situation where multiple results are returned? Picture1

Screenshot 2024-04-27 at 22 47 41
glennhickey commented 6 months ago

If there's one copy of gene A in species 1 and two copies in species 2, then then all three copies will (probably) be aligned together in Cactus. Due to such paralogous relationships, you can expect a given query region to map to multiple reference regions. There's a tool, halSynteny that tries to filter this somewhat. You can run it yourself or within cactus-hal2chains

Marh32 commented 6 months ago

Thank you so much for your reply. I have try to use halSynteny to filter it. And I get the results as follow:

Screenshot 2024-04-29 at 14 46 23

In this situation, should the alignment result of the third line be considered error or attributed to such paralogous relationships? Consequently, when searching for orthologous genes or conserved elements, should I indeed filter out these alignment outcomes(like thrid line)? Also, I find that there is some missing alignment information between blocks in the returned result (such as from 30317824 (in the first row) to 30323901 (in the second row)), is there any way I can get this missing alignment information? Thank you for your help