ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
481 stars 106 forks source link

WGS bigger than phased reference genome on ODGI Viz plot. (How to use phased and unphased WGS for pangenome with Minigraph-cactus? #1369

Open Donandrade opened 2 months ago

Donandrade commented 2 months ago

Dear all,

I'm here for the first time in the community and have been trying for some time to obtain a pangenome to use as a reference for mapping short reads using vg giraffe and subsequent variant calling usingvg call (I'm considering using vg surject+ Deepvariant). Below, I will outline my questions regarding the main data used, the results obtained (Figure 1), and the command executed to obtain my pangenome using Minigraph-cactus:

  1. Data Used: I'm working with genomes of a plant species. I'm using data from two phased genomes (one diploid and one tetraploid) and 11 WGS from other individuals of the species (Figure 1).

  2. Command Executed: I executed the command below to run Minigraph-cactus and obtain the pangenome for the plant in question: cactus-pangenome jobStorePath14 seqFile --outDir pg14 --outName pg14 --reference REF --vcf --giraffe --gfa --gbz --viz --mgCores 30 --mgMemory 250Gi --mapCores 30 --consCores 30 --consMemory 250Gi --indexCores 30 --indexMemory 250Gi

As a reference (REF), I used one of the haplotypes of the tetraploid genome (Haplotype1 - see Figure 1 below) from "Phased Genome X" (Figure 1 below). The other haplotypes of "Phased Genome X" were used as samples to be incorporated into the pangenome.

  1. Questions:

3.1. Why are there large regions (yellow background on Figure 1) in both the WGS (1 to 11) and the haplotypes of "Phased Genome X" that do not align with the reference? I expected at least the haplotypes to align well with the reference (Haplotype1).

3.2. If the approach I used is correct (using one of the haplotypes as reference and the others as samples), is there any way to keep only the region aligned with the reference (blue background on Figure 1)?

3.4. Do I actually have a problem? Or is the profile found in Figure 1 common in pangenomes where phased genomes are used together with WGS?

3.5. How can this 'issue' I am reporting (Unaligned regions of the samples on the reference) affect the quality of the pangenome and subsequent mapping of short reads and variant calling using vg giraffe and vc call?

Page 19 Figure 1: 1D representation of the pangenome obtained from 11 WGS (WGS 1, WGS 2, WGS 3, WGS 4, WGS 5, WGS 6, WGS 7, WGS 8, WGS 9, WGS 10, WGS 11) + phased polyploid genome (Phased Genome X) + phased diploid genome (Phased Genome Y). The blue background highlights the region of the WGSs and haplotypes that map against the reference, while the yellow background corresponds to regions of the WGSs and haplotypes that did NOT map. The bolded genome (Haplotype1) corresponds to the haplotype of Phased Genome X used as a reference in Minigraph-cactus.

glennhickey commented 2 months ago

The reference has to be a single haplotype. So if you have a diploid sample you want to use as a reference, you have to choose one haplotype for your reference and, effectively, treat the other haplotype(s) as a different sample. This has been annoying me and, I'm sure, others, but I don't really have a solution in mind. It doesn't affect the alignmenet at all -- just the path names in your final graph. So I think what you've done for this part is fine.

I think I am also confused by your figure. Is it for a single chromosome? Is the reference really half the length of your other input genomes? It is possible to have a large insertion relative to the reference, but normally (due to how minigraph works) this would not be at the end of the sequence, nor would it be so large.

Donandrade commented 2 months ago

@glennhickey Thank you very much for your answer ad attention. You are really making a brilliant work here, with Minigraph-cactus.

Regarding your questions:

  1. I also find myself confused by your figure. Is it for a single chromosome?

Yes, in this case, I ran the cactus-pangenome command with the option --viz. From what I understood, with this option, a 1-dimensional visualization using odgi will be generated for each chromosome in separate figures. The figure I showed here before (Figure 1) is for one of the chromosomes, but the profile you seen is similar for all other plots, including all the rest of haplotypes on the same genome. Therefore, I understanding that this is not due to a large insertion.

I'm not displaying all results here, but in all the tests I conducted, my reference (single haplotype) ends up being smaller than the other genomes. Moreover, it seems that this difference increases as I increase the number of genomes (Figure 2), as seen in the example below with 10 more genomes (28 in total, including the reference). Page 21 Figure 2. 1-dimensional visualization using odgi with more genomes.

  1. "Is the reference really half the length of your other input genomes?"

It seems so, but this size increases with the addition of new genomes.

glennhickey commented 2 months ago

Very mysterious. I suspect it's a problem with how cactus-pangenome is making the plots. I definitely recommend checking your sequence lengths -- you can use odgi paths -lL. Also, if you can share a chromosome odgi file, I'll take a closer look here.

Donandrade commented 1 week ago

Hi @glennhickey !

Thank you for your suggestion. Sorry for the delay. I tried rerunning the MC using the --chrom-og option to get the .og files, but I failed twice, encountering an error. In the end, I tried to use the GFA file to convert it to an .og file, but I had a core dump error. I'm trying to solve these errors, but if the GFA file could be useful, I'm sending it here.

chr1.gfa.gz

Thank you for your time and help.