eblerjana / pangenie

Pangenome-based genome inference
MIT License
102 stars 10 forks source link

Feature request: support kff input files #62

Open JosephLalli opened 7 months ago

JosephLalli commented 7 months ago

Hello,

Many users of PanGenie might wish to use vg giraffe to align reads for short-variant calling, but use PanGenie for structural variant calling.

Inspired by PanGenie, the new haplotype-based workflow in vg giraffe now utilizes a kmer-based approach to generate candidate large indels to align reads to. That workflow requires kmer counting before alignment, and it requires the kmer counts be in the kff format.

Currently, PanGenie only accepts pre-calculated kmer counts in .jf format. Is it possible to add support for kmer counts in kff format? In my initial experiments, PanGenie works remarkably quickly when working on a graph generated from the output of vg haplotypes. However, the majority of the running time is devoted to kmer counting. Being able to reuse kmer counts from the vg giraffe/vg haplotypes portion of the personalized pangenome pipeline would allow PanGenie to be a lightweight add-on SV caller.

Is it possible to have PanGenie accept pre-calculated kff input files?

Thanks, Joe Lalli

eblerjana commented 7 months ago

Hi,

I need to look more into this, but one problem I see right now with the KFF format is that random access is not supported (at least according to their paper). But in order to run efficiently, PanGenie needs to be able to lookup counts of specific kmers in a fast way, which would not be possible here. With the latest version of PanGenie and HPRC/HGSVC panels, most time is spent on the genotyping part anyways (assuming the PanGenie-index preprocessing step was run once on the panel), so I'm not sure how big the effect would be on the overall runtime even if a faster lookup was possible.

Best, Jana

JosephLalli commented 7 months ago

Hi Jana,

That does sound like a problem...

I'm specifically trying to implement the workflow presented in this paper: https://www.biorxiv.org/content/10.1101/2023.12.13.571553v2.full

which is a modified version of what is presented in the haplotype sampling section of vg's wiki page: https://github.com/vgteam/vg/wiki/Haplotype-Sampling

Their use of PanGenie is better described in the supplementary material, but a brief outline of what they do:

I've been trying this approach with some of my data. When employed in this fashion (ie with only two paths in the graphical genome), PanGenie is shockingly lightweight. A representative run's memory and time usage (24 threads) is after the body of this comment.

Pangenie's part of this workflow takes 35m 43s. Pangenie reports that 24m 44s of this time is spent "coutning kmers in reads". It seems to me that if we could use the counts generated in the first step, that would be a big time/resource savings.

Alternatively, vg haplotypes could accept jellyfish counts, or I could use jellyfish to begin with and convert via jellyfish dump -> txt -> kff. I will put in a feature request to have vg haplotypes accept jellyfish counts. However, I thought I should also post a request here.

Thanks, Joe


###### Summary PanGenie-index ######
time spent reading input files: 217.93 sec
time spent counting kmers in genome (wallclock):    66.7741 sec
time spent writing Graph objects to disk:   7.99445 sec
time spent determining unique kmers:    1495.91 sec
time spent writing UniqueKmersMap to disk:  15.9927 sec
total wallclock time PanGenie-index: 426.108 sec

Max RSS after reading input files:  5.52515 GB
Max RSS after counting kmers in genome:     29.7598 GB
Max RSS after determining unique kmers:     32.1154 GB
Max RSS:    32.1154 GB
############## Summary ##############
total wallclock time:   430.953 sec
Max RSS:    32.1154 GB
#####################################

###### Summary PanGenie-genotype ######
time spent reading UniqueKmersMap from disk:    2.9527 sec
time spent counting kmers in reads (wallclock):     1484.64 sec
time spent pre-computing probabilities:     0.00241776 sec
time spent updating unique kmers:   1625.8 sec
time spent selecting paths:     2.64403 sec
time spent genotyping chromosome chr1:  21.8248
.... smaller amounts of time for other chromosomes...
time spent genotyping (total):  298.693 sec
time spent writing output VCF:  109.424 sec
total wallclock time PanGenie-genotype: 1713.26 sec

Max RSS after reading UniqueKmersMap from disk:     2.50662 GB
Max RSS after counting kmers in reads:  26.7416 GB
Max RSS after pre-computing probabilities:  26.7416 GB
Max RSS after updating unique kmers:    26.7489 GB
Max RSS after selecting paths:  26.7489 GB
Max RSS after genotyping:   26.7489 GB
Max RSS:    26.7489 GB
############## Summary ##############
total wallclock time:   1713.82 sec
Max RSS:    26.7489 GB
#####################################

(Note that in hindsight this should probably be using pangenie's one-step method to save I/O time.)

JosephLalli commented 7 months ago

Hi @eblerjana,

@jltsiren at vg suggests that simply loading everything into a hash map might be a good idea to enable random access for kmer counts stored in kff files. (See https://github.com/vgteam/vg/issues/4215)

-Joe

eblerjana commented 7 months ago

Hi Joe,

ah I see! I've recently read the preprint. I haven't tried the pipeline myself, but I agree that it sounds quite interesting to try that strategy. You are right that in such a setting where the number of haplotypes is reduced, the runtime is mostly spent in the counting part. I think this haplotype sampling pipeline would be especially useful once we have even bigger graphs with more haplotypes, because at some point PanGenie would get too slow when run on all the paths.

I agree, I think creating a hashmap should work fine here as well and thinking about it, I think with the way PanGenie is implemented right now, it should be relatively easy to integrate this. I'll try to implement this feature. Right now I'm quite busy, so it'll probably take a while (sorry!). But I'll let you know once I have something ready!

Best, Jana