bacpop / unitig-caller

Methods to determine sequence element (unitig) presence/absence
Apache License 2.0
18 stars 3 forks source link

Memory issues #10

Closed martinastoycheva closed 2 years ago

martinastoycheva commented 2 years ago

Hello,

I am having trouble allocating memory to the program. I am using the following command:

unitig-caller --call \
--threads 10 \
--pyseer \
--rtab \
--refs refs.txt \
--reads reads.txt

I have tried running this command with 10 cpus and 300gb of memory. It runs for 15h and it is killed due to unsuficient memory allocation. The datset is not large so this seems odd to me. It's only 95 bacterial genomes, each ~3.5Gb long. I can see that the progam is partly written in java and I was wodnering if that could be the issue? I have access to an HPC and increasing the memory is not an issue but I am just wodnering if you have some practical advice as to how many threads is optimal to use. Also, the default 31 bp for a k-mer seems a bit small to me and I thought increasing it may save on mempry but increasing it generates an error saying it cannot be more than 31.

Cheers, Martina

johnlees commented 2 years ago

Could I ask for a little more information here:

None of the program is in java though!

@samhorsfield96 is k=31 the maximum for bifrost?

martinastoycheva commented 2 years ago

Hello,

============================ Job utilisation efficiency

Job ID: 17331934 Cluster: viking User/Group: mms565/clusterusers State: OUT_OF_MEMORY (exit code 0) Nodes: 1 Cores per node: 20 CPU Utilized: 6-11:13:17 CPU Efficiency: 77.82% of 8-07:27:00 core-walltime Job Wall-clock time: 09:58:21 Memory Utilized: 198.10 GB Memory Efficiency: 99.05% of 200.00 GB Requested wall clock time: 2-00:00:00 Actual wall clock time: 09:58:21 Wall clock time efficiency: 20.8% Job queued time: 03:17:59


- Does it work on a subset of the data?
I will try this and report but the data is not large

Cheers, 
Martina 
samhorsfield96 commented 2 years ago

Hi Martina,

I'm sorry to hear you're running into issues.

The maximum k-mer value allowed for Bifrost, which is the de-Bruijn graph builder used to generate the unitigs, is 31 base-pairs. This is because this is the maximum size for a k-mer whose hash value can fit in a 64-bit integer. However, this can be increased by compiling Bifrost with a larger maximum k-mer size if you wish (https://github.com/pmelsted/bifrost#:~:text=the%20Troubleshooting%20section.-,Large%20k%2Dmers,-The%20default%20maximum). However, we have found 31 base pairs to work well with most bacterial genomes, and increasing this may instead reduce the number of matching unitigs across each genome.

With regards to the memory issue, I would suggest placing the paths to all of your assemblies in the refs.txt file, and do not use --reads reads.txt. This is meant for a list of raw reads files, rather than assembled genomes.

All the best,

Sam

martinastoycheva commented 2 years ago

Hello Sam,

I had made an error when describing what the refs.txt and reads.txt files contain. I have edited my answer now. However, did I understand correctly that I should be using the software either with assemblies or raw reads and not with both of them together?

Cheers, Martina

samhorsfield96 commented 2 years ago

Hi Martina,

That's right, you don't need to supply both the assemblies and reads for each isolate, you can supply either one or the other. As the assembly process will have removed a lot of erroneous k-mers and increased contiguity, I would always advise using the assemblies if you have them. I predict that you are running out of memory becuase you are using both the assemblies and reads in this case.

Hopefully this sorts your issue, but please let us know if it persists.

Sam

martinastoycheva commented 2 years ago

Hello Sam,

Yes, using only the assemblies solved the issue! Thanks very much for your help!

Cheers, Martina