Open markcharder opened 1 year ago
Good question. I've been spoiled to be working with really high-quality assemblies while developing this pipeline, ex the hprc. But when making an example mouse graph, I was surprised at just how many N's made into the graph.
Anyway, handling these cases better is on the to-do list, and I'm very open to ideas of how to best go about it. As it stands, I'm planning on applying a filter option to cactus-graphmap-join
that'd remove nodes or subsegments with some threshold of N bases. By worry is that it may really cut up the graph paths making them even more difficult to work with.
As workarounds to try now: maybe setting --clip xxxx
in cactus-graphmap-join
to something way smaller than the default of 10kb. Otherwise, You'll have to explicitly filter the VCF yourself.
Hope this helps.
I have just used cactus minigraph to align 10 ~ 1.1 Gb genomes with the commands:
Everything ran fine.
I am interested in a family of conserved genes that I do not expect to see any major variation in. However, in addition to smaller SVs and SNPs, there are some megabase scale SVs affecting these genes in the final VCF output (which should be the filtered one as I used '--giraffe'). Upon closer inspection, at least some of these variants have long tracts of 'N's in them, which indicates they span regions where contigs have been joined based on a reference. For example, here is a section of one large InDel:
I was wondering how cactus minigraph handles 'N's in genome assemblies and whether these might be the cause of the large structural variants.
Any advice would be greatly appreciated.
Mark