Closed taylorreiter closed 1 year ago
Some general thoughts: would it be feasible to have some kind of visual map of the inputs/outputs from a Snakemake workflow? Basically a graphic of the DAG? It would really help to be able to understand how each of the rules are relating to each other. I imagine it also might help external folks not as familiar with Snakemake, since these pipelines tend to be interconnected, and it can be a little tedious to search for input and output manually.
Yes! I'll include this in the readme. If you have snakemake and graphviz installed via conda:
snakemake --dag | dot -tpdf
I'll include this in the README and update it with each PR to the Snakefile
Also if possible, it would also be helpful to have an in-code description of the underlying functions of each rule. Perhaps just transferring the description from this PR to each rule would be good enough.
Will add!
Finally, for wildcards defined within the Snakefile, would it be possible to have a comment section that describes what they are? It seems like they're defined implicitly upon calling (or maybe I'm probably just missing something) which could make it a little difficult to understand their origin and contents.
Will add!
This PR starts a snakemake pipeline to detect recent HGT events in eukaryotic genomes (see README). This PR only includes genomes so far -- in future PRs I'll make the changes necessary for the pipeline to also run on eukaryotic transcriptomes.
This PR includes rules that:
checkpoint_accessions_to_genus
. This class takes the CSV output by thecheckpoint accessions_to_genus
and determines which genome accession numbers belong in a given genus, essentially creating a genus:genome accession map. The class is used on the input ofrule combine_cds_per_genus
to determine which genome accessions get combined together into a multifasta file that will then get clustered. I wrote this in the snakefile itself, but this approach is described in depth in this blog post: http://ivory.idyll.org/blog/2021-snakemake-checkpoints.html.tetramerfreq.sh
to get tetramernucleotide frequency.The PR also includes all of the environment and auxiliary scripts needed to run the snakefile.
In my next PR, I'll introduce rules/scripts to parse the output of these current rules to identify candidate HGT events.