dkoslicki / MetaPalette

Metagenomic profiling and phylogenetic distances via common kmers
Other
42 stars 5 forks source link

Database training - expected time, and resuming crashed runs? #8

Closed zkstewart closed 6 years ago

zkstewart commented 7 years ago

Hello,

These may not be issues so much as questions. Firstly, I am currently training a large database of genomes (20713), consisting of 691 Archaea, 12732 Bacteria, 1048 Fungi, 273 Protozoa, and 5969 Viruses. The combined file size for this is approximately 31gb. I have been running this job on a HPC with 24 cores and access to a (more or less) unlimited size ramdisk. As of now, it has been performing the final Bcalm stage after making the 30/50mer files for roughly a month, and is at 21416 of cpu hours thus far (with an actual running time of 1246 hours). Is this normal? Since I am working with metagenomic data from a region where I expect most species will be only distantly related to any known species, I wanted to get as representative a set of known species as possible to help with prediction. Was it a mistake to try to train such a large database? And for future users, is there any way to predict how long a job should take to complete? A feature such as this would be very useful since my study has been held up by the metagenome taxonomy analysis.

Secondly, since the HPC I am using has a maximum time limit on job submissions (1344 hours), I anticipate that it will not complete in time. Is there some way to resume this run after the job is cancelled?

Thanks zkstewart

dkoslicki commented 7 years ago

Hi, and thanks for the questions!

For a database this large, it is to be expected that the run time will be long (but not as long as you are experiencing). I think the issue is the following: bcalm has quite a long run time for large genomes. This is actually why I excluded certain organisms from the pre-trained databases (you can see in the source code here I excluded a few organisms). What I would suggest is removing from your input file names (the argument to the flag -i in Train.py) those genomes that are bigger than, say, 150M when bz2 compressed (or 500M when uncompressed). Bacterial genomes are roughly less than 5M when bz2 compressed, and this indicates that the genomes themselves will have a relatively short run time in bcalm.

So removing things with very large genomes (which it should be relatively easy to ascertain that aren't in your sample) will significantly speed up your run time.

As for your question as to if MetaPalette will resume the training after an aborted job: yes. Most all the time intensive steps (k-mer counting and bcalm formation) will be skipped for files for which this step has already been performed (as long as you have not deleted any of the output (in the folder given by the argument to the -o flag)). The only step that must be re-done is the formation of the common kmer matrices (which must be re-computed, as a different argument to the -i flag indicates that the basis for that matrix will change).

Hope this helps, and please let me know if you run into any other problems.

Thanks,

~David

zkstewart commented 7 years ago

Hi David,

Thanks for your response. I'll have a go at removing large genomes to see if I can get an output in a reasonable time frame. If it isn't an issue, I do have one more question which could help with speeding up the process in an alternative way. Is it possible to combine the results of multiple classification runs into a single result file in a way that will still provide accurate relative abundances of the different types of organisms? For example, I thought it could be possible for me to use the pretrained databases for each of bacteria, archaea, etc. since they are more in-depth than the comparison database, then collate these results in some way. Would this be possible by not running the classification step with the -n tag for example?

Thanks again, zkstewart

dkoslicki commented 7 years ago

It's technically possible to combine results in this way (not using the -n flag), but I would be hesitant to trust the results. Inferring the absolute abundances is much more difficult than the relative abundances, and simply combining and then normalizing could introduce a fair bit of systematic error. The "right" way to do it would be to combine the training files, and use the resulting (large) common kmer matrices and do it all in one go. Sorry there doesn't appear to be a quick, easy fix for that!