Long running times in population mode when using raw reads

IdoBar commented 1 year ago

Hi,

I'm running unitig-caller in population mode to by used by pyseer for GWAS (following the documentation). I don't have assemblies for all my samples (~250 samples), so I'm using the raw reads (2x 1-3gb gzipped paired-end fastq files for each sample) and it takes a long time to process. I was wondering a couple of things:

Is it possible to run each sample separately (in parallel) and later combine the graphs? will it be more computationally efficient?
The input files don't contain any sample name column and all my samples have multiple input files (paired-end), so I don't understand at which point the uniting presence is being calculated/counted per sample (to serve as input for pyseer)?

Thanks, Ido

samhorsfield96 commented 1 year ago

Hi Ido,

Building with reads rather than assemblies is more computationally intensive, as Bifrost has to conduct an additional filtering step to remove low coverage k-mers. Also, there are more sequences to include in the graph than using assemblies, which also increases runtime. To answer your questions:

Unitig caller is already parallelized using Bifrost. You could alternatively build the graphs incrementally using Bifrost in 'update' mode (see https://github.com/pmelsted/bifrost), and then pass the .gfa and .bfg_colors files back to unitig caller using either call or query mode, depending on whether you want the unitigs for your current population, or to check unitig presence/absence for a new population. This process only adds one sample at a time and does not merge graphs however, so I'm unsure what speed-up you would get if any.
As Bifrost generates edge-induced graphs, any k-mers with a k-1 overlap are connected by an edge, and so contiguity information is not taken into account. Paired-end reads from the same sample should therefore be merged into the same file, as the paired-end information is not used to generate unitigs.

I hope this helps. Please let me know if you have any further questions.

Sam

IdoBar commented 1 year ago

Thanks for your reply @samhorsfield96 ,

At the moment trying to run all the samples together in call mode requires more than 100GB and ran for over 65 hours (before reaching the memory limit and failing). I understand that overall I won't save any time, but will splitting it into smaller jobs will reduce memory consumption so my HPC could handle them? What is the time and memory-intensive step? Is it the graph generation or the calling? I'll try to generate the graph first and see if it finishes successfully.

Thanks, Ido

samhorsfield96 commented 1 year ago

Hi Ido,

We haven't extensively tested the computational efficiency of each step, although I would predict it will be dataset dependent (more variation will make graph generation take longer with more memory, more genomes will impact calling). I'm not 100% sure on the effect of incrementally adding genomes either. If you are happy to do so, I would be interested to hear how your approach fares with regards to memory consumption.

All the best,

Sam

bacpop / unitig-caller

Long running times in population mode when using raw reads #22