This PR is the first step of ...many to merge the nextflow rehgt repo with this one, so as to reduce duplication of code, especially of R scripts. I set out to only touch R scripts, alas that was not sufficient to get a viable snakemake workflow, so I had to make a few more changes.
small changes I caught when converting to nextflow
changing from scripts to command line executables
to accommodate the last R script and to make it the same as in the nextflow workflow, I had to change the download script to match the one that is in nextflow as well. It's not a very snakemakey rule in that I do multiple operations in one place, but doing it like that both keeps it consistent and dramatically simplifies the wildcard logic. It is also more complete -- previously i was only downloading from GenBank and required the user to provide genome accessions. Now it downloads from refseq and genbank (which fun fact are not mirrors of each other) and the user only need to provide the genus/genera name(s) of interest.
I've only done a dry run so far, but that all passes 🎉 I will run this on a test data set soon, but plan to do it for the next PR if that's ok.
PR roadmap:
update R scripts and sync snakefile (this PR)
copy over required nextflow files (still need to figure out how I want to do organization...a Snakemake folder & a nextflow folder? everything at root?)
(maybe) refactor snakemake file to take command line arguments for databases
(maybe) include an optional download snakefile/nextflow workflow to download all of the required databases (two files for clustered nr blast database, two eggnog database files, an hmm profile file)
update workflow:
improve labels for HGT candidates
check and make sure donor diversity index is correct
This PR is the first step of ...many to merge the nextflow rehgt repo with this one, so as to reduce duplication of code, especially of R scripts. I set out to only touch R scripts, alas that was not sufficient to get a viable snakemake workflow, so I had to make a few more changes.
I've only done a dry run so far, but that all passes 🎉 I will run this on a test data set soon, but plan to do it for the next PR if that's ok.
PR roadmap: