Open IdoBar opened 1 year ago
Hi Ido,
Building with reads rather than assemblies is more computationally intensive, as Bifrost has to conduct an additional filtering step to remove low coverage k-mers. Also, there are more sequences to include in the graph than using assemblies, which also increases runtime. To answer your questions:
call
or query
mode, depending on whether you want the unitigs for your current population, or to check unitig presence/absence for a new population. This process only adds one sample at a time and does not merge graphs however, so I'm unsure what speed-up you would get if any.I hope this helps. Please let me know if you have any further questions.
Sam
Thanks for your reply @samhorsfield96 ,
At the moment trying to run all the samples together in call
mode requires more than 100GB and ran for over 65 hours (before reaching the memory limit and failing).
I understand that overall I won't save any time, but will splitting it into smaller jobs will reduce memory consumption so my HPC could handle them?
What is the time and memory-intensive step? Is it the graph generation or the calling? I'll try to generate the graph first and see if it finishes successfully.
Thanks, Ido
Hi Ido,
We haven't extensively tested the computational efficiency of each step, although I would predict it will be dataset dependent (more variation will make graph generation take longer with more memory, more genomes will impact calling). I'm not 100% sure on the effect of incrementally adding genomes either. If you are happy to do so, I would be interested to hear how your approach fares with regards to memory consumption.
All the best,
Sam
Hi,
I'm running
unitig-caller
in population mode to by used bypyseer
for GWAS (following the documentation). I don't have assemblies for all my samples (~250 samples), so I'm using the raw reads (2x 1-3gb gzipped paired-end fastq files for each sample) and it takes a long time to process. I was wondering a couple of things:pyseer
)?Thanks, Ido