ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
486 stars 108 forks source link

Questions/feature request: region specific clipping/filtering, pool-seq #925

Closed swomics closed 1 year ago

swomics commented 1 year ago

Hi,

Working with the cactus/vg programs has been really insightful and has produced some very interesting results so far. I'm currently genotypying some short read samples based on known SVs present in some long-read genomes. The region I am interested has lots of structural variation, which I think lends itself to this pangenome framework well. As my datasets have increased in size and overlapping complexity at the region of interest, I'm coming up against the limits of the data/programs. I would like to optimize the genotyping of this complex region as much as possible, and I have three questions:

  1. I understand from the wiki that filtering clipping is important. Unfortunately, the dataset I have requires me to genotype known rare SVs present in a specific region. Is there a way to specify the level of filtering/clipping regionally, when constructing graphs for giraffe? For example, preserving rare variants on a specific reference chromosome/window.
  2. Is it possible that a haplotype graph can decrease performance as compared to an unphased graph? In an earlier analysis, I had some high quality heterozygous genotype calls for an ~800bp deletion, with good depth for both alleles. In a subsequent run with larger set of reference genotypes, the quality of these calls seemed to decrease significantly, they are now called as homozygous deletions, and with a much lower coverage. I should point out that this region of the graph became more complex with many structural variants in close proximity.
  3. Is it possible to apply pool-seq data to the vg pipeline, particularly genotype calling - I recall an earlier issue mentioning that polyploidy would be introduced, but I'm not sure if this is implemented.
glennhickey commented 1 year ago

These are all excellent questions.

  1. We're working on this in vg. The idea will be to keep the full graph and then, when mapping with giraffe, dynamically clip down to the subgraph that most fits the reads. This way it can ignore common variants that aren't in the sample while keeping rare variants that are. As for targetted clipping, there is a bit of an interface in vg clip where you can pass in a BED file, but you'd need to do it manually outside of cactus-graphmap-join.
  2. Unfotunately yes, which is why we've kind of been stuck using the various allele-frequency filters. We're optimistic the logic described above will help work around this issue. In the meantime, PanGenie seems to be more robust to these issues since it doesn't do any mapping (it still requires some filtering, but not as much).
  3. This does ring a bill, but I don't think there's been much progress on it, unfortunately.
swomics commented 1 year ago

Thank you for the detailed reply!