caporaso-lab / genome-sampler

https://caporasolab.us/genome-sampler/
BSD 3-Clause "New" or "Revised" License
5 stars 10 forks source link

add documentation section on running in parallel #64

Closed gregcaporaso closed 4 years ago

gregcaporaso commented 4 years ago

I've received questions from multiple users on how to run in parallel, so we should add a specific section to the docs on this. Here is some text copied from my replies that we can use in this section:

genome-sampler can be run in parallel to speed it up. This is done in different ways depending on whether you're running the steps individually or through Snakemake.

If you're using Snakemake, you need to edit Snakefile and set the N_THREADS value to the number of threads you'd like genome-sampler to use.

If you're running the steps individually you can pass the --p-n-threads option to several of the commands. For example, sample-diversity is the slowest step in the workflow. You can provide the --p-n-threads parameter to run it in parallel:

qiime genome-sampler sample-diversity \
 --i-context-seqs filtered-context-seqs.qza \
 --p-percent-id 0.9995 \
 --o-selection diversity-selection.qza \
 --p-n-threads n

When running this command, you should set n to be the number of available processors or cores on a single node of your system. For example, I work on a cluster that has nodes with 28 cores, so when I submit this job I would run:

qiime genome-sampler sample-diversity \
 --i-context-seqs filtered-context-seqs.qza \
 --p-percent-id 0.9995 \
 --o-selection diversity-selection.qza \
 --p-n-threads 28

This uses all of the resources on a single node of the cluster for me. In the future we'll be adding support for splitting workflows like this across multiple cluster nodes, but we do not have this support at this time.