allind / EukDetect

MIT License
40 stars 15 forks source link

using snakemake directly #3

Closed nick-youngblut closed 3 years ago

nick-youngblut commented 3 years ago

Thanks for making this pipeline! I've been using snakemake for many years and have found it to be great. I was just going through the pipeline code, and I'm just wondering why you wrote a python front-end to the pipeline instead of just using snakemake directly. For instance, it seems that your runall.py script mainly just determines which parts of the pipeline to run, but this can be done within snakemake by providing functions as input (especially for the all rule), and then the user can just select which parts of the pipeline to run in the config.yaml file. Is there a particular advantage to the python script frontend? I'd like to know for my future snakemake projects.

Also, are you planning on adding support for snakemake profiles & resources for cluster jobs? This is a bit trickier with a python front-end that calls snakemake, but it's doable.

allind commented 3 years ago

Thanks for your interest. The logic behind the front-end part of the pipeline is that it's meant to check that all correct input files exist (some which aren't specified in the rules file) and meet some requirements that aren't included in the snakemake rules. It also informs users before starting snakemake if an analysis already exists in the output folder and gives the option to overwrite those files. This is meant to provide informative error messages.

If this isn't desired, the snakemake pipeline can be run directly with eukdetect.rules by specifying different targets. Each of the eukdetect modes corresponds to existing targets in the eukdetect.rules file. The correspondence between the modes and the target rules are in the code but I can provide more information in the readme.

Currently there's no direct support for cluster jobs, but there is a workaround for now. The computation-heavy part of the pipeline is mainly in the alignment step. The mode alncmd prints alignment commands to a file that can be run as the user desires on a cluster, particularly as a job array. The rest of the pipeline, which uses fewer resources, can be run after the job/job array finishes with the filter command.

nick-youngblut commented 3 years ago

Great! Thanks for taking the time to explain what's going on!