Large number of genomes

larssnip commented 3 years ago

First, a suggestion: It would be very helpful to be able to turn off the screen output. We use fastANI with a single query genome against a long list (thousands) of reference genomes (--refList option) and listing thousands of filenames each time is annoying and rather useless.

But, the main problem lies in our observation that listing 30 000+ files and provide it as input using --refList results in fastANI not producing any output! There is no error message, it starts as before, but looks like the it just gives up, and finishes without producing output. I have, by experimenting, found that 10 000 files works fine. I know several UNIX programs have a limit on how long a commandline may be. Is this the reason? I run this on an HPC, and allocate 99GB for this job. It doesn't look to me like a memory problem...?

cjain7 commented 3 years ago

For the first, you can easily turn off screen output by redirecting stderr log. Just append 2>/dev/null to the end of your FastANI command.

For the second, the memory usage is proportional to the total size of references provided. Your run could be running out of memory. To resolve this, you can run the job in batches (of say 5000 reference genomes) by using a bash script. You can use this script if you want. If you have a cluster, you could also parallelise these batches across multiple compute nodes. I don't think this is happening due to any UNIX limits.

larssnip commented 3 years ago

Thanks for this. I did the batching myself, actually, and it works. The reason I did not think of this as a memory problem is that there was no "out of memory" message related to this termination. This is usually the case on the cluster.

ParBLiSS / FastANI

Large number of genomes #76