ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
521 stars 111 forks source link

Incorporating short-read sequence assemblies for SV detection #1116

Open whyskyisgray opened 1 year ago

whyskyisgray commented 1 year ago

Hello Developers,

I've successfully built a pan-genome consisting of 12 genomes using MC without any problems. Thank you for your kind descriptions on GitHub.

Now, I'm planning to find more SVs by adding short-read sequence data.

From the method you suggested at GitHub Issue #943, you recommended calling structural variants that are already in the graph (using PanGenie or vg call), or creating a new VCF by surjecting.

However, I would like to explore relatively larger SVs using short-read sequences. My question is about your opinion on whether the method I'm trying to attempt is possible. (The genome size of my species is about 200Mbp.)

My proposed workflow is as follows:

  1. Assemble short-read sequences into contigs using assembly tools.
  2. Utilize these contig-level assemblies as input files for MC.
  3. Perform SV (Structural Variation) calling with pan-genie.

Please consider if the method I'm trying to attempt is appropriate.

Thank you for your support, and I look forward to hearing from you soon!

Best Regards

glennhickey commented 1 year ago

There are no ways I'm aware of to call de-novo SVs directly from short reads using a graph.

So your proposal of assembling your samples then making a new graph seems reasonable. If you include all your samples, then you can see your variants directly in the VCF that MC makes (no need for pangenie, unless you are exploring additional samples).

You just need to keep in mind that the quality of your results will depend quite a bit on the quality of the short read assemblies...

whyskyisgray commented 1 year ago

Thank you for your prompt and clear response :)

I thought that using pan-genie would enable calling in heterozygous regions.

Thank you for your answer! Have a great day!

glennhickey commented 1 year ago

You're right -- you'd need to re-genotype to get heterozygous calls. My previous message was written under the (obviously wrong for short reads) assumption that your input assemblies would be phased.