[x] add resources like threads: and resources: mem_mb to each rule. This will allow snakemake to parallelize within the limits of resources available on the cluster.
[x] add rules to run random forests with different seeds. This should produce different exact results that still have the same biological meaning.
[x] remove custom gather databases and use new genbank databases.
[ ] Note that these databases are currently only available on farm, and so would need to be downloaded by the user outside of the pipeline. This is unfortunate from a reproducibility standpoint and the origin/availability/current location of databases should be well documented in methods etc.
[x] update sourmash gather threshold-bp to 0
[x] remove rules that re-implemented --save-matches from gather
[x] update hash_genome_map rules to accommodate output from --save-matches instead of list of signatures
[x] this might still need to be updated for rule create_hash_genome_map_at_least_5_of_6_vita_vars_pangenome, but can be templated from rule create_hash_genome_map_at_least_5_of_6_vita_vars
[x] remove sourmash lca commands
[x] implement/integrate species-level summarization of gather results
[x] use genome-grist to download gather matches
[x] consider integrating charcoal before running spacegraphcats queries.
[x] update spacegraphcats environment
[x] figure out new spacegraphcats file endings and propagate throughout dependencies
[ ] add a rule to extract-paired-reads.py from the sgc output
[x] integrate all the singlem rules, some of which were floating in other snakefiles/workflows/test dirs
[x] abundtrim: do marker genes recapitulate model success?
[x] sourmash sigs of abundtrim: are marker genes the only thing sourmash is picking up on?
[x] nbhds: do the nbhds contain all the marker genes?
[ ] update singlem of sgc to take advantage of paired end reads?
[x] remove PLASS rules
[ ] add megahit assemble, prokka annotate, cdhit cluster, salmon quantify, and corncob differential abund. This is all implemented in a testdir workflow (sandbox/test_megahit_diginorm_nocat/Snakefile) but no longer needs to be separated out in a separate snakefile.
To Do:
threads:
andresources: mem_mb
to each rule. This will allow snakemake to parallelize within the limits of resources available on the cluster.threshold-bp
to 0--save-matches
from gatherhash_genome_map
rules to accommodate output from--save-matches
instead of list of signaturesrule create_hash_genome_map_at_least_5_of_6_vita_vars_pangenome
, but can be templated fromrule create_hash_genome_map_at_least_5_of_6_vita_vars
sourmash lca
commandsextract-paired-reads.py
from the sgc outputsandbox/test_megahit_diginorm_nocat/Snakefile
) but no longer needs to be separated out in a separate snakefile.