Arcadia-Science / prehgt

A pipeline for lightweight screening of Eukaryotic genomes and transcriptomes for recent HGT
MIT License
12 stars 6 forks source link

Bacterial Genera #56

Open Nimshika opened 10 months ago

Nimshika commented 10 months ago

Does the current version work to detect bacterial recipients?

taylorreiter commented 10 months ago

In principle yes, there is nothing in the algorithms that prevents it from working. However, this line of code does: https://github.com/Arcadia-Science/prehgt/blob/main/Snakefile#L56

ncbi-genome-download vertebrate_mammalian,vertebrate_other,invertebrate,plant,fungi,protozoa --output-folder {params.outdir} --flat-output --genera {wildcards.genus} -F gff,cds-fasta -s $section --retries 3

You can see the sets that ncbi-genome-download downloads from does NOT include bacteria at the moment. I was concerned about collisions in genus names that would lead to bacteria + e.g. fungi being downloaded and leading to very weird results.

if you want to run it on bacteria, I can provide instructions for what to change from your own fork if you let me know if you're running the snakemake or nextflow pipeline.

Nimshika commented 10 months ago

Thanks for the quick response. I am running the nextflow pipeline.

Nimshika commented 10 months ago

Hi Taylor, Thought of writing a detailed explanation of our current requirement. That makes sense - re: potential conflict in genus names. Yes, I’d like to run the pipeline on bacteria (I have already tried it out with yeast and we are interested in testing on a couple specific genera of bacteria as well that are relevant). I have been using the Nextflow pipeline, and am happy to fork the repo and modify code if you are willing to provide instructions - however I’m a little unfamiliar with Nextflow and wondering how I would then point to my forked repo to run correct version? (It makes more sense to me in the context of snakemake where github repo is directly specified). One other idea (that may be to much work on your end - so if that is the case I understand!) - would be if you are able to add a flag (-domain or similar… ) that allows user to specific target as euks or bacteria? Thinking this might be useful to others who are interested in running it across both. Thanks again for your help! Nimshika

Nimshika commented 10 months ago

Hi Taylor, I am following up on this - I’ve moved to the snakemake pipeline and forked the repo - can you please provide instructions for what to change in my own fork so that I can run bacteria? Thanks!

taylorreiter commented 10 months ago

Hi @Nimshika, sorry for the delay.

One other idea (that may be to much work on your end - so if that is the case I understand!) - would be if you are able to add a flag (-domain or similar… ) that allows user to specific target as euks or bacteria? Thinking this might be useful to others who are interested in running it across both.

This is a great idea of the long run. When I revisit prehgt development, I'll add this in!

in the mean time, the only thing you need to change is line 48 and line 56 of the snakefile: https://github.com/Arcadia-Science/prehgt/blob/main/Snakefile#L48 https://github.com/Arcadia-Science/prehgt/blob/main/Snakefile#L56

You need to replace vertebrate_mammalian,vertebrate_other,invertebrate,plant,fungi,protozoa with bacteria

For example, line 56 currently reads:

ncbi-genome-download vertebrate_mammalian,vertebrate_other,invertebrate,plant,fungi,protozoa --output-folder {params.outdir} --flat-output --genera {wildcards.genus} -F gff,cds-fasta -s $section --retries 3

and should instead read:

ncbi-genome-download bacteria --output-folder {params.outdir} --flat-output --genera {wildcards.genus} -F gff,cds-fasta -s $section --retries 3

Similarly, if you were to change to the nextflow workflow, you could change line 36 and line 44 of the modules/download_reference_genomes.nf file:

https://github.com/Arcadia-Science/prehgt/blob/main/modules/download_reference_genomes.nf#L36 https://github.com/Arcadia-Science/prehgt/blob/main/modules/download_reference_genomes.nf#L44

in both places, the string vertebrate_mammalian,vertebrate_other,invertebrate,plant,fungi,protozoa needs to be replaced with bacteria.

To run the nextflow pipeline, you can clone your fork locally, cd into the directory, and use nextflow run . with all of the other pipeline parameters you would usually use and it should work.

Nimshika commented 10 months ago

Hey Tailor, Thanks for your reply and helpful information on the edits.