start pipeline - Githubissues

This PR starts a snakemake pipeline to detect recent HGT events in eukaryotic genomes (see README). This PR only includes genomes so far -- in future PRs I'll make the changes necessary for the pipeline to also run on eukaryotic transcriptomes.

This PR includes rules that:

download the CDS from genomic and gff files from GenBank for eukaryotic genomes
parses each CDS name in each multifasta so that it becomes the protein ID. this rule was added because the DNA composition tool codonw truncates names at 20 or 25 characters. protein ids are unique identifiers and will allow me to look up a given CDS in a GFF file still.
generates a genus-level pseudo-pangenome by clustering the CDS sequences at 0.9 percent identity. this is the trickiest bit of the snakefile. To accomplish this task, I made the class checkpoint_accessions_to_genus. This class takes the CSV output by the checkpoint accessions_to_genus and determines which genome accession numbers belong in a given genus, essentially creating a genus:genome accession map. The class is used on the input of rule combine_cds_per_genus to determine which genome accessions get combined together into a multifasta file that will then get clustered. I wrote this in the snakefile itself, but this approach is described in depth in this blog post: http://ivory.idyll.org/blog/2021-snakemake-checkpoints.html.
After clustering, each CDS in the clustered pangenome is scanned with codonw to get all sorts of DNA composition and amino acid metrics, and with bbmap tetramerfreq.sh to get tetramernucleotide frequency.

The PR also includes all of the environment and auxiliary scripts needed to run the snakefile.

In my next PR, I'll introduce rules/scripts to parse the output of these current rules to identify candidate HGT events.

Some general thoughts: would it be feasible to have some kind of visual map of the inputs/outputs from a Snakemake workflow? Basically a graphic of the DAG? It would really help to be able to understand how each of the rules are relating to each other. I imagine it also might help external folks not as familiar with Snakemake, since these pipelines tend to be interconnected, and it can be a little tedious to search for input and output manually.

Yes! I'll include this in the readme. If you have snakemake and graphviz installed via conda:

snakemake --dag | dot -tpdf

I'll include this in the README and update it with each PR to the Snakefile

Also if possible, it would also be helpful to have an in-code description of the underlying functions of each rule. Perhaps just transferring the description from this PR to each rule would be good enough.

Will add!

Finally, for wildcards defined within the Snakefile, would it be possible to have a comment section that describes what they are? It seems like they're defined implicitly upon calling (or maybe I'm probably just missing something) which could make it a little difficult to understand their origin and contents.

Will add!

Arcadia-Science / prehgt

start pipeline #2