ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
523 stars 111 forks source link

cactus-pangenome --reference with multiple reference genomes differs for small and large pangenomes? #1172

Closed dirkjanvw closed 1 year ago

dirkjanvw commented 1 year ago

Hi, I have been trying out the Minigraph-Cactus pangenome pipeline with the provided yeast data set with subsequent mapping of short-reads to it. vg surject against any genome specified as "reference" works perfectly! However, when I use the exact same commands with a larger dataset (6 chromosome-level assemblies of ±2.5 Gbp each), I can only run vg surject against the first specified "reference". I looked at the ${prefix}.d2.gbz files for both using vg paths -MHx ${gbz} and I noticed that for the yeast pangenome all genomes are indeed available as "reference" but that for the large pangenome, only the first assembly is available as "reference".

This is how I ran the MC pangenome pipeline:

cactus="singularity exec /path/to/singularity/cactus_v2.6.7-gpu.sif cactus"
export TMPDIR=/dev/shm/cactus-tmp
[ -d "${TMPDIR}" ] && rm -rd ${TMPDIR}
mkdir -p ${TMPDIR}

${cactus}-pangenome ./js ./examples/yeastPangenome.txt --reference S288C SK1 DBVPG6044 UWOPS034614 Y12 YPS128 --vcfReference S288C --outDir yeast-pg --outName yeast-pg --vcf --giraffe --gbz --gfa --odgi --chrom-vg --chrom-og --viz --draw
${cactus}-pangenome ./js ./examples/largePangenome.txt --reference genomeA genomeB genomeC genomeD genomeE genomeF --vcfReference genomeA --outDir large-pg --outName large-pg --vcf --giraffe --gbz --gfa --odgi --chrom-vg --chrom-og --viz --draw

Btw, I masked out the names of the large pangenome because the data is not (yet) publicly available. If you have a dataset of the same size that I should try to confirm, I'm happy to run it.

This is the output of vg paths:

$ for gbz in {yeast,large}-pg/*-pg.d2.gbz; do echo; ls ${gbz}; vg paths -Mx ${gbz} | head; done

yeast-pg/yeast-pg.d2.gbz
#NAME   SENSE   SAMPLE  HAPLOTYPE   LOCUS   PHASE_BLOCK SUBRANGE
DBVPG6044#0#chrI    REFERENCE   DBVPG6044   0   chrI    NO_PHASE_BLOCK  NO_SUBRANGE
S288C#0#chrI    REFERENCE   S288C   0   chrI    NO_PHASE_BLOCK  NO_SUBRANGE
SK1#0#chrI[17147]   REFERENCE   SK1 0   chrI    NO_PHASE_BLOCK  17147
UWOPS034614#0#chrI[9943]    REFERENCE   UWOPS034614 0   chrI    NO_PHASE_BLOCK  9943
Y12#0#chrI  REFERENCE   Y12 0   chrI    NO_PHASE_BLOCK  NO_SUBRANGE
YPS128#0#chrI[13]   REFERENCE   YPS128  0   chrI    NO_PHASE_BLOCK  13
DBVPG6044#0#chrII[20796]    REFERENCE   DBVPG6044   0   chrII   NO_PHASE_BLOCK  20796
S288C#0#chrII   REFERENCE   S288C   0   chrII   NO_PHASE_BLOCK  NO_SUBRANGE
SK1#0#chrII[33678]  REFERENCE   SK1 0   chrII   NO_PHASE_BLOCK  33678

large-pg/large-pg.d2.gbz
#NAME   SENSE   SAMPLE  HAPLOTYPE   LOCUS   PHASE_BLOCK SUBRANGE
genomeA#0#chr0  REFERENCE   genomeA 0   chr0    NO_PHASE_BLOCK  NO_SUBRANGE
genomeF#0#chr1#8686 HAPLOTYPE   genomeF 0   chr1    8686    NO_SUBRANGE
genomeF#0#chr1#4691928  HAPLOTYPE   genomeF 0   chr1    4691928 NO_SUBRANGE
genomeF#0#chr1#5348629  HAPLOTYPE   genomeF 0   chr1    5348629 NO_SUBRANGE
genomeF#0#chr1#11616551 HAPLOTYPE   genomeF 0   chr1    11616551    NO_SUBRANGE
genomeF#0#chr1#17160624 HAPLOTYPE   genomeF 0   chr1    17160624    NO_SUBRANGE
genomeF#0#chr1#18862116 HAPLOTYPE   genomeF 0   chr1    18862116    NO_SUBRANGE
genomeF#0#chr1#21012291 HAPLOTYPE   genomeF 0   chr1    21012291    NO_SUBRANGE
genomeF#0#chr1#21663157 HAPLOTYPE   genomeF 0   chr1    21663157    NO_SUBRANGE

From what I understand, only paths designated as "REFERENCE" in the second column can be used for vg surject? At least, that is my experience. This is how I would run vg surject for completeness' sake:

vg paths -LRx yeast-pg.d2.gbz | awk 'BEGIN{FS = "#";} $1=="Y12"' > Y12_paths.txt
vg surject -x yeast-pg.d2.gbz -t10 -s -F Y12_paths.txt SRR800844.gam > SRR800844_to_Y12.sam

However, this would never work for the large pangenome because the other genomes specified with --reference are not included as "reference" paths in the GBZ file (only genomeA is):

$ vg paths -LRx large-pg/large-pg.d2.gbz 
genomeA#0#chr0
genomeA#0#chr1
genomeA#0#chr2
genomeA#0#chr3
genomeA#0#chr4
genomeA#0#chr5
genomeA#0#chr6
genomeA#0#chr7
genomeA#0#chr8
genomeA#0#chr9

Is this because of the large size of my pangenome? If so, is there a specific flag I should set when working with pangenomes larger than yeast to make sure my other genomes are included as "reference" paths in the resulting graph?

glennhickey commented 1 year ago

There is no size-dependent behaviour here that I can think of: if you pass in a sample via --reference it should always end up as a surjectable REFERENCE-sense path in the final output, just like you are seeing for the yeast example.

I think a few versions ago, the multiple reference support may have been a bit buggy. So if you ran your larger data with an older version of cactus and yeast with the latest version, that would definitely explain it.

You might be able to see where it's going wrong in the log. Your reference genomes should appear in hal2vg --reference parameters.

They should also get added to the GFA header with a sed command

-e 1s/.*/H    VN:Z:1.1        RS:Z:YPS128 S288C _MINIGRAPH_/"

Finally, you can take the outpout GFA file, and add the genomes to the RS tag in the header by hand if you want. After that you can regenerate your GBZ file (look for command in log) and all paths will be REFERENCE -- it's really that header that controls it.

dirkjanvw commented 1 year ago

Thanks a lot for all the suggestions!

I double checked the version and the version i used was 2.6.7-gpu for the large one.

Diving into the log using your tips, I did find something strange. First off, the hal2vg commands looked fine. I see the exact same command (including all references specified) being run on a temporary hal file 10 times (probably corresponding to my 9 chromosomes + chromosome 0?). However, the sed command you mentioned is not identical for all times it is run. For my chromosome 0 it only mentions my genomeA whereas for the other 9 chromosomes it mentions all references. Also the resulting GFA file (large-pg.gfa.gz) has this header: H VN:Z:1.1 RS:Z:genomeA, without the other references. Could it be that the pipeline takes the smallest set of references it can find for all chromosomes? By the way, I did double check all input fasta files and all files have a chromosome 0 (but I would agree that I should have left out this chromosome 0 in the first place). In the output directory large-pg/chrom-subproblems/chr0/fasta/ I can see that indeed only genomeA has a non-empty file, whereas for the other chromosomes all references have non-empty files. (Probably because those chr0 couldn't be mapped to the minigraph graph?)

In conclusion, do you think I will solve the issue by removing the chromosome 0 from all input files? And is my hypothesis correct that the pipeline takes only those references that are present in all individual chromosome alignments? Personally, I would have expected the pipeline to still include all paths even if it cannot match a chromosome for all references.

glennhickey commented 1 year ago

Yeah, I think you've figured it out (thanks!!). I am able to reproduce it here on the yeast data by manually dropping a sample for the first chromosome. It doesn't get put in the GFA header for that chromosome even if it's specified with --reference, and the final merged GFA inherits this header leading to the issue you described.

Dropping your chromosome-0 sounds like it would indeed be a work-around. As would hand-editing your GFA to fix the header, then regenerating the .gbz.

This definitely looks like a bug. I don't see why it doesn't just put every reference in every header whether or not it's used -- so I will update it to do this in the next release.

dirkjanvw commented 1 year ago

Aha, thanks for the temporary workarounds! And looking forward to the next release!