Handling of 'N's in input sequences

I have just used cactus minigraph to align 10 ~ 1.1 Gb genomes with the commands:

cactus-minigraph \
    ./jobstore \
    ../genome_seq_file.txt \
    ${prefix}.sv.gfa \
    --reference $reference_id \
    --workDir $TMPDIR 1> cactus-minigraph_log.txt 2>&1

cactus-graphmap \
    ./jobstore \
    ../genome_seq_file.txt \
    ${prefix}.sv.gfa \
    ${prefix}.paf \
    --outputFasta ${prefix}.sv.gfa.fa \
    --reference $reference_id \
    --workDir $TMPDIR 1> cactus-graphmap_log.txt 2>&1

cactus-graphmap-split \
    ./jobstore \
    ../genome_seq_file.txt \
    ${prefix}.sv.gfa \
    ${prefix}.paf \
    --outDir ${prefix}_chroms \
    --reference $reference_id \
    --workDir $TMPDIR 1> cactus-graphmap-split_log.txt 2>&1

cactus-align-batch \
    ./jobstore \
    ${prefix}_chroms/chromfile.txt \
    ${prefix}_chroms/chrom-alignments \
    --alignOptions "--pangenome --reference $reference_id --outVG" \
    --workDir $TMPDIR --maxCores 8 --maxMemory 40G 1> cactus-align-batch_log.txt 2>&1

cactus-graphmap-join \
    ./jobstore \
    --vg ${prefix}_chroms/chrom-alignments/*.vg \
    --hal ${prefix}_chroms/chrom-alignments/*.hal \
    --outDir ${prefix} \
    --outName ${prefix} \
    --reference $reference_id \
    --vcf \
    --giraffe \
    --workDir $TMPDIR 1> cactus-graphmap-join_log.txt 2>&1

Everything ran fine.

I am interested in a family of conserved genes that I do not expect to see any major variation in. However, in addition to smaller SVs and SNPs, there are some megabase scale SVs affecting these genes in the final VCF output (which should be the filtered one as I used '--giraffe'). Upon closer inspection, at least some of these variants have long tracts of 'N's in them, which indicates they span regions where contigs have been joined based on a reference. For example, here is a section of one large InDel:

Lee.Gm01_0      13375471        >398851>399156  TNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCCCCCAA

I was wondering how cactus minigraph handles 'N's in genome assemblies and whether these might be the cause of the large structural variants.

Any advice would be greatly appreciated.

Mark

Good question. I've been spoiled to be working with really high-quality assemblies while developing this pipeline, ex the hprc. But when making an example mouse graph, I was surprised at just how many N's made into the graph.

Anyway, handling these cases better is on the to-do list, and I'm very open to ideas of how to best go about it. As it stands, I'm planning on applying a filter option to cactus-graphmap-join that'd remove nodes or subsegments with some threshold of N bases. By worry is that it may really cut up the graph paths making them even more difficult to work with.

As workarounds to try now: maybe setting --clip xxxx in cactus-graphmap-join to something way smaller than the default of 10kb. Otherwise, You'll have to explicitly filter the VCF yourself.

Hope this helps.

ComparativeGenomicsToolkit / cactus

Handling of 'N's in input sequences #859