ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
481 stars 106 forks source link

pangenome.vcf.gz seems to report SV coordinates relative to the END, not start, of the variant #1373

Closed adadiehl closed 1 month ago

adadiehl commented 1 month ago

Been experimenting with augmenting the HPRC MC pangenome with additional structural variants embedded into a host genome at known positions. We first build a MC graph using the HPRC samples plus our additional genomes. From there, we extract SVs of interest from the graph for further study.

Oddly, when I try to recover the known variants from the graph, using the procedure below, I only recover the expected sequence if I supply coordinates as chrom:(variant_position - varlant_len)-variant_position, whereas using chrom:variant_position-(variant_position + variant_len) yields a string that lacks the expected variant sequence.

Here's what I did, starting with VCF positions from pangenome.vcf.gz:

1) Translate variant position (minus 1 for zero-based) to coordinate frame of the host genome via liftOver. 2) Add/subtract the variant length to get the expected variant end/start position, respectively. 3) Extract the subgraph for coordinates obtained from 1-2 with odgi extract, retaining only the host path 4) Convert the odgi output to GFA1.0 5) Extract fasta sequence corresponding to the host path as the concatenation of the node sequences implied by the node IDs and orientations in the host "P" line. 6) Search the fasta sequence from 5 for a substring match to the known variant sequence.

Following this procedure, I only recover the expected substring in the output when I subtract the element length from the lifted VCF position. It seems like the reverse should be true. Is this a bug or am I missing something?

glennhickey commented 1 month ago

Sorry, I don't think I'm following exactly what you're doing. I'm pretty confident in the VCF positions of the variants (I think someone would have noticed by now if they are wrong). But if you have a specific example to go over, I'm happy to take a look.

One thing about insertions is that they only have one position -- you don't want to add length to them or you will get a different position. I don't know that can be affecting your analysis...

adadiehl commented 1 month ago

Hi Glenn,

Thank you for the response. With further investigation, this behavior seems to result from how insertion coordinates are handled in one of the upstream tools used to generate the assemblies with which we are augmenting the HPRC graph.

The issue you mention pulling out insertions based on the VCF coordinate does not affect us here as we are using coordinates in the frame of the assembly containing the insertion, so we need to account for the length. What was not immediately clear is that, to get the correct genomic range from liftOver, we actually need to lift over the positions immediately before and after the VCF coordinate (minus one, for zero-based, of course) to get the correct genomic range as a consequence of how the insertions are incorporated to the liftOver chains.

That said, sorry to have bothered you with this.