ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
511 stars 112 forks source link

append reference contigs lost after the pangenome pipeline? #1132

Closed jyj5558 closed 10 months ago

jyj5558 commented 1 year ago

Hi, just a quick question.

After running the pangenome pipeline, I found that several reference contigs were lost and not represented in the result files although I used "keep_contigs" as in the tutorial. As I used contig-level non-reference assemby samples, I guess this might happen because there were no non-reference contigs to be aligned and anchored on some reference contigs, which could not generate a graph assembly or graph alignment file then and are missing now.

I just want to use this graph genome as a resource (better one than linear reference) later to map short reads onto it, so I need these missing contigs to be represented in a same file, too. Is there any way I can append the missing contigs in the graph genome's gfa or vg file? Or is it the only way to merge vcf files (after read mapping), one generated against the graph genome and the other one generated against the missing reference contigs at downstream step? Hope to have your great insight on this.

Appreciate your advice,

glennhickey commented 1 year ago

What was your command line? By default, all reference contigs (ie those in the sample specified with --reference) will be present in the final output graph, whether or not anything else aligns to them.

jyj5558 commented 1 year ago

Thanks for your fast reply, I referenced "Keep_contig" names from the fasta index file of my reference genome (attached) as you see below. There are 206 contigs in the reference file but I got 200 result vg files. lepc_genomic.fna.fai.txt

my code is:

keep_contigs=$(awk '{print $1}' /scratch/LEPC/JY-test/LEPC_ref/lepc_genomic.fna.fai)

n=64
site=south
RefInd=south_panref2

# Preprocess seqFile
mkdir -p Temp/
cp ${site}Pangenome.txt ${site}Pangenome.mc.seqfile
cactus-preprocess ./jobstore ${site}Pangenome.txt ${site}Pangenome.mc.seqfile --pangenome --workDir Temp/
echo "preprocessing done"

# Make the SV graph with minigraph in GFA format 
cactus-minigraph ./jobstore ${site}Pangenome.mc.seqfile ${site}LEPC.gfa --reference ${RefInd} --maxCores ${n} --workDir Temp/
echo "minigraph done"

# Map each input assembly back to the graph (assembly-to-graph alignments) using minigraph
cactus-graphmap ./jobstore ${site}Pangenome.mc.seqfile ${site}LEPC.gfa ${site}LEPC.paf --outputGAFDir ./${site}-mc-gaf --outputFasta ${site}LEPC.gfa.fa --reference ${RefInd} --nodeStorage 1000 --delFilter 10000000 --workDir Temp/ --maxMemory 1000G --mapCores 16 --maxCores ${n} --maxNodes 25 --logFile ${site}LEPC.paf.log
echo "graphmap done"

# Split by chromosomes (scaffolds)
cactus-graphmap-split ./jobstore ${site}Pangenome.mc.seqfile ${site}LEPC.gfa ${site}LEPC.paf --outDir ./contigs --otherContig contigOther --refContigs $(for i in $keep_contigs; do echo ${i}; done) --reference ${RefInd} --nodeStorage 1000 --workDir Temp/ --maxMemory 1000G --maxCores ${n} --maxNodes 5 --logFile ${site}LEPC.split.log
echo "split done"

# Create the Cactus base alignment and "raw" pangenome graph
cactus-align --batch ./jobstore ./contigs/chromfile.txt ./align --consCores ${n} --maxCores ${n} --maxMemory 1000G --workDir Temp/ --logFile ${site}LEPC.align.log --maxNodes 20 --nodeStorage 1000 --pangenome --maxLen 10000 --reference ${RefInd} --outVG --noLinkImports --noMoveExports --cleanWorkDir=onSuccess
echo "align done"

# Create and index the final pangenome graph
cactus-graphmap-join ./jobstore --vg ./align/*.vg --hal ./align/*.hal --outDir ./${site}-pg --outName ${site}-pg --reference ${RefInd} --vcf --gbz --gfa --chrom-vg --giraffe {clip,full} --clip 10000 --nodeStorage 1000 --maxNodes 20 --workDir Temp/ --indexCores $((n-1)) --maxCores ${n} --maxMemory 1000G --cleanWorkDir=onSuccess --disableCaching --chrom-og --draw --viz --odgi --logFile ${site}LEPC.join.log
echo "join" done

# END

THanks!

glennhickey commented 1 year ago

I'm looking at the manual and the only reference to "keep_contigs" I see is in the manual patch of the HG02080.1 HPRC yr1 sample, which surely doesn't apply to your data.

Anyway, looking at your commands, your final graph should contain all contigs from RefInd=south_panref2. You can verify that by looking at the GFA or using vg paths.

jyj5558 commented 1 year ago

Oh I might have bookmarked an older version of the manual. I referred to something like for i in $keep_contigs; do echo ${i}; echo chrX chrY chrZ; done. Ok, I will check that and come back for further questions or closing this thread. Thanks!