samples aligned per chromosome not consistent

brianabernathy commented 1 year ago

Hello,

I'm working with the v2.6.2 docker image, aligning 10 samples with 10 chromosomes each. My output looks reasonable, however, only 3 chromosomes contain alignments/paths for all 10 samples. The remaining 7 chromosomes contain only the same 7 species. Dot plots show similarity among all samples for each respective chromosome.

There are no errors generated. The one clue I was able to spot was that the "Species tree" lines in the stderr output are consistent with what is described above. 3 chrs with all 10 samples and 7 chrs with only 7 samples. All tree sample values = 0.025, the only difference are the chr sample counts. All fasta headers are "chrXX", etc... I did notice that the missing 3 samples are a bit more divergent than the others. They must be near a threshold and are included in some chrs and not others. Are there tree (or any other) parameters I should consider adjusting?

The input files are large and I haven't tried sub-setting the problem and checking for reproducibility. I could likely reduce the inputs to 2 chrs, 1 with 10 samples and 1 with 7, if you would like data to reproduce.

Brian

glennhickey commented 1 year ago

Is this with the pangenome pipeline? If so, you can usually get some kind of explanation out of chrom-subproblems/minigraph.split.log. If gives the chromosome assignment of each input contig, and mentions the ones it drops alongside the failing coverage thresholds. If you're seeing the dropout here, the fix is probably to run with --permissiveContigFilter which will relax the coverage thresholds required for chromosome assignment. You might want to try experimenting with this with cactus-graphmap-split rather than cactus-pangenome for a shorter turnaround...

brianabernathy commented 1 year ago

Thanks, another option I missed! Yes, I was using cactus-pangenome. I can see the failing ambiguous query contigs in chrom-subproblems/minigraph.split.log. I'll try tweaking the --permissiveContigFilter and see if I can't pull in all sample chrs.

From a read alignment standpoint, it seems like you'd want to include all sample sequences for each chromosome in the full graph. (as singleton nodes?) This would prevent reads that should align to the unassigned/ambiguous contigs (or entire chrs in my case) from being considered unmapped.

brianabernathy commented 1 year ago

I wanted to confirm that adjusting the --permissiveContigFilter threshold did allow me to capture all sample chromosomes in the graph. The 3 more-divergent samples (that were being removed from several chrs in the original graph) are closely-related to each other. I've noticed that the updated graph includes much of these 3 samples as singleton regions when, ideally, they would be aligned together. I haven't experimented with providing multiple --references, would this attempt to align sequences to the primary reference first, followed by the secondary reference, etc...? Is this the preferred method to accomplish the secondary grouping/aligning described?

glennhickey commented 1 year ago

No, the multiple --references just lets reference 2...N be tagged as REFERENCE paths in vg, which allows them to be easily used for coordinate projecting (ex to make VCFs or BAMs on). These additional references do not get any special treatment in terms of alignment.

I think what's happening is that your samples are too diverged to be mapped effecitvely with minigraph. A tiny fraction maps, which is enough to get them into reference contigs (after adjusting the filter), but from there not much happens as cactus does not have enough anchors to work with.

In theory, you can play with the minigraph options to support more divergence but I've personally not had much luck doing this.

But I'm a bit wary of making very diverse pangenome graphs because, at least with vg, they can easily become too complex to usefully work with. At that point, I'd recommend switching to progressive cactus. Another idea is PGGB, whose alignment parameters, I believe, can be adjusted to support more distant species but, as mentioned above, I don't think you'd be able to use vg with the results, at least not to map with.

brianabernathy commented 1 year ago

Thanks, I suspected we were pushing the divergence limits of minigraph. I was providing soft-masked genomes, hoping this would restrict seed selection and alignment to conserved regions. I would expect genic regions to provide anchors spread across the chromosomes for all samples. I may be misunderstanding how masked sequence is used during processing.

Is masking the best approach to restrict the search space? Have you experimented with only aligning sample genes to a complete reference sequence? I'm trying to think of other options that may allow divergent samples to better align without causing the graph complexity to become unmanageable...

glennhickey commented 1 year ago

Softmasking plays a big role in progressive cactus, but is completely ignored by minigraph-cactus. All I can do at this point is to recommend the former for more diverged datasets.

But speaking gene anchors, Heng is working on gene-only graphs which seems like a neat idea (very preliminary though) https://github.com/lh3/pangene

ComparativeGenomicsToolkit / cactus

samples aligned per chromosome not consistent #1097