Open revinici opened 3 months ago
Hi,
I have not encountered such a large memory usage before. Are you running very large genomes? It has primarily been developed for bacteria (i.e. <10 Gb genomes). We have run it on 25K genomes on only moderate hardware with no issues so it is likely an issue of the parallelisation duplicating large files.
Your fix would be a sensible consideration but not one I will be able to address in the short-term. If you want to have a crack at it, please branch and push the relevant changes for me to test :)
All the best, Sion
Hello, I am running it with small bacterial genomes of about 6Mb in size. When you ran it on 25K genomes, how many threads did you use? How long did it take? I would be interested in the exact pirate command and version if you have it. Thanks for the response!
Hi, usually it would have been run with between 12-24 cores with ~64-128 Gb RAM. I have transitioned to a new HPC setup so I cannot directly compare. If you have a smaller sample size and are running bacterial genomes I wouldn't expect the analysis to take more than a few hours (10-100s samples) to a few days (1000s-10,000s samples). It will depend both on the size of the genomes and the genetic diversity present in the collection. If you are comparing more distantly related bacteria it will take longer. I would also not align the genes (-a) if you are concerned about running time, that is very costly and can be performed afterwards. You can trial it on smaller subsample of genomes to see how it scales.
I noticed that a lot of memory is used when
link_clusters.pl
is run in parallel. Can you explain what would determine the memory usage for each execution of this script? Depending on the response, I wonder if it would be worth it to provide a different threads flag to pass tolink_clusters_runner.pl
to control the amount of memory used. I've seen the memory used during this stage surpass 128 GB of RAM causing out of memory errors when analyzing 2000 bacterial genomes. Analyzing 1000 genomes used about 3/4 of the RAM. Note, I was using 72 threads.