NBChub / bgcflow

Snakemake workflow for the analysis of biosynthetic gene clusters across large collections of genomes (pangenomes)
https://github.com/NBChub/bgcflow/wiki
MIT License
35 stars 9 forks source link

Prokka: Rnammer does not found rRNA #44

Closed matinnuhamunada closed 2 years ago

matinnuhamunada commented 2 years ago

Hi @OmkarSaMo, take a look at this two runs:

Rnammer called:

[18:27:12] This is prokka 1.14.6
[18:27:12] Written by Torsten Seemann <torsten.seemann@gmail.com>
[18:27:12] Homepage is https://github.com/tseemann/prokka
[18:27:12] Local time is Wed Dec 15 18:27:12 2021
[18:27:12] You are matin
[18:27:12] Operating system is linux
[18:27:12] You have BioPerl 1.007002
[18:27:12] System has 16 cores.
[18:27:12] Will use maximum of 8 cores.
[18:27:12] Annotating as >>> Bacteria <<<
[18:27:12] Generating locus_tag from 'data/interim/fasta/P8-2B-3.1.fna' contents.
[18:27:12] Setting --locustag NDIKIJJE from MD5 7d24233e949b09efcc4f3510c54e9cf0
[18:27:12] Re-using existing --outdir data/interim/prokka/P8-2B-3.1
[18:27:12] Using filename prefix: P8-2B-3.1.XXX
[18:27:12] Setting HMMER_NCPU=1
[18:27:12] Writing log to: data/interim/prokka/P8-2B-3.1/P8-2B-3.1.log
[18:27:12] Command: /data/a/matinnu/projects/bgcflow/.snakemake/conda/32b7024e71617b25f8fa2ac04b92044e/bin/prokka --outdir data/interim/prokka/P8-2B-3.1 --force --proteins resources/prokka_db/reference.gbff --prefix P8-2B-3.1 --genus Streptomyces --species sp. --strain P8-2B-3 --cdsrnaolap --cpus 8 --rnammer --increment 10 --evalue 1e-05 data/interim/fasta/P8-2B-3.1.fna
[18:27:12] Appending to PATH: /data/a/matinnu/projects/bgcflow/.snakemake/conda/32b7024e71617b25f8fa2ac04b92044e/bin
[18:27:12] Looking for 'aragorn' - found /data/a/matinnu/projects/bgcflow/.snakemake/conda/32b7024e71617b25f8fa2ac04b92044e/bin/aragorn
[18:27:12] Determined aragorn version is 001002 from 'ARAGORN v1.2.38 Dean Laslett'
[18:27:12] Looking for 'barrnap' - found /data/a/matinnu/projects/bgcflow/.snakemake/conda/32b7024e71617b25f8fa2ac04b92044e/bin/barrnap
[18:27:12] Determined barrnap version is 000009 from 'barrnap 0.9'
[18:27:12] Looking for 'blastp' - found /data/a/matinnu/projects/bgcflow/.snakemake/conda/32b7024e71617b25f8fa2ac04b92044e/bin/blastp
[18:27:12] Determined blastp version is 002012 from 'blastp: 2.12.0+'
[18:27:12] Looking for 'cmpress' - found /data/a/matinnu/projects/bgcflow/.snakemake/conda/32b7024e71617b25f8fa2ac04b92044e/bin/cmpress
[18:27:12] Determined cmpress version is 001001 from '# INFERNAL 1.1.4 (Dec 2020)'
[18:27:12] Looking for 'cmscan' - found /data/a/matinnu/projects/bgcflow/.snakemake/conda/32b7024e71617b25f8fa2ac04b92044e/bin/cmscan
[18:27:12] Determined cmscan version is 001001 from '# INFERNAL 1.1.4 (Dec 2020)'
[18:27:12] Looking for 'egrep' - found /bin/egrep
[18:27:12] Looking for 'find' - found /usr/bin/find
[18:27:12] Looking for 'grep' - found /bin/grep
[18:27:12] Looking for 'hmmpress' - found /data/a/matinnu/projects/bgcflow/.snakemake/conda/32b7024e71617b25f8fa2ac04b92044e/bin/hmmpress
[18:27:12] Determined hmmpress version is 003003 from '# HMMER 3.3.2 (Nov 2020); http://hmmer.org/'
[18:27:12] Looking for 'hmmscan' - found /data/a/matinnu/projects/bgcflow/.snakemake/conda/32b7024e71617b25f8fa2ac04b92044e/bin/hmmscan
[18:27:12] Determined hmmscan version is 003003 from '# HMMER 3.3.2 (Nov 2020); http://hmmer.org/'
[18:27:12] Looking for 'java' - found /data/a/matinnu/projects/bgcflow/.snakemake/conda/32b7024e71617b25f8fa2ac04b92044e/bin/java
[18:27:12] Looking for 'makeblastdb' - found /data/a/matinnu/projects/bgcflow/.snakemake/conda/32b7024e71617b25f8fa2ac04b92044e/bin/makeblastdb
[18:27:13] Determined makeblastdb version is 002012 from 'makeblastdb: 2.12.0+'
[18:27:13] Looking for 'minced' - found /data/a/matinnu/projects/bgcflow/.snakemake/conda/32b7024e71617b25f8fa2ac04b92044e/bin/minced
[18:27:13] Determined minced version is 004002 from 'minced 0.4.2'
[18:27:13] Looking for 'parallel' - found /data/a/matinnu/projects/bgcflow/.snakemake/conda/32b7024e71617b25f8fa2ac04b92044e/bin/parallel
[18:27:13] Determined parallel version is 20211022 from 'GNU parallel 20211022'
[18:27:13] Looking for 'prodigal' - found /data/a/matinnu/projects/bgcflow/.snakemake/conda/32b7024e71617b25f8fa2ac04b92044e/bin/prodigal
[18:27:13] Determined prodigal version is 002006 from 'Prodigal V2.6.3: February, 2016'
[18:27:13] Looking for 'prokka-genbank_to_fasta_db' - found /data/a/matinnu/projects/bgcflow/.snakemake/conda/32b7024e71617b25f8fa2ac04b92044e/bin/prokka-genbank_to_fasta_db
[18:27:13] Looking for 'rnammer' - found /usr/bin/rnammer
[18:27:13] Determined rnammer version is 001002 from 'This rnammer 1.2, running from '/home/omkar/Projects/packages/RNAmmer''
[18:27:13] Looking for 'sed' - found /bin/sed
[18:27:13] Looking for 'tbl2asn' - found /data/a/matinnu/projects/bgcflow/.snakemake/conda/32b7024e71617b25f8fa2ac04b92044e/bin/tbl2asn
[18:27:13] Determined tbl2asn version is 025007 from 'tbl2asn 25.7   arguments:'
[18:27:13] Will use RNAmmer instead of Barrnap for rRNA prediction
[18:27:13] Using genetic code table 11.
[18:27:13] Loading and checking input file: data/interim/fasta/P8-2B-3.1.fna
[18:27:14] Wrote 1 contigs totalling 7099240 bp.
--------------------------------------------------------------
[18:27:22] Predicting Ribosomal RNAs
[18:27:22] Running RNAmmer
[18:27:22] Running: rnammer -S bac -multi -xml data\/interim\/prokka\/P8\-2B\-3\.1\/rnammer\.xml data\/interim\/prokka\/P8\-2B\-3\.1\/P8\-2B\-3\.1\.fna
[18:27:25] Deleting unwanted file: data/interim/prokka/P8-2B-3.1/rnammer.xml
[18:27:25] Found 0 rRNAs

RNAmmer missing, Barrnap takeover:

[21:59:24] This is prokka 1.14.6
[21:59:24] Written by Torsten Seemann <torsten.seemann@gmail.com>
[21:59:24] Homepage is https://github.com/tseemann/prokka
[21:59:24] Local time is Sun Jul  4 21:59:24 2021
[21:59:24] You are matinnu
[21:59:24] Operating system is linux
[21:59:24] You have BioPerl 1.007002
[21:59:24] System has 16 cores.
[21:59:24] Will use maximum of 12 cores.
[21:59:24] Annotating as >>> Bacteria <<<
[21:59:24] Generating locus_tag from '/data/matinnu/zhiyan/genomes/P8-2B-3/prokkainput.fna' contents.
[21:59:25] Setting --locustag LHPKCPNC from MD5 5194c97c4a1d302e80c6591eef2c3862
[21:59:25] Creating new output folder: /data/matinnu/zhiyan/genomes/P8-2B-3/P8-2B-3_prokka_actinoannotPFAM
[21:59:25] Running: mkdir -p \/data\/matinnu\/zhiyan\/genomes\/P8\-2B\-3\/P8\-2B\-3_prokka_actinoannotPFAM
[21:59:25] Using filename prefix: P8-2B-3.XXX
[21:59:25] Setting HMMER_NCPU=1
[21:59:25] Writing log to: /data/matinnu/zhiyan/genomes/P8-2B-3/P8-2B-3_prokka_actinoannotPFAM/P8-2B-3.log
[21:59:25] Command: /home/WIN.DTU.DK/matinnu/phd/projects/snakemake-prototype/workflow/notebooks/.snakemake/conda/0e315c616d79d8e6df0cee47bc2d4a09/bin/prokka --outdir /data/matinnu/zhiyan/genomes/P8-2B-3/P8-2B-3_prokka_actinoannotPFAM --proteins ../resources/Actinos_6species.gbff --prefix P8-2B-3 --genus Streptomyces --species sp._NRRL_B-3253_0.9921 --strain P8-2B-3 --cdsrnaolap --cpus 12 --rnammer --increment 10 --evalue 1e-05 /data/matinnu/zhiyan/genomes/P8-2B-3/prokkainput.fna
[21:59:25] Appending to PATH: /home/WIN.DTU.DK/matinnu/phd/projects/snakemake-prototype/workflow/notebooks/.snakemake/conda/0e315c616d79d8e6df0cee47bc2d4a09/bin
[21:59:25] Looking for 'aragorn' - found /home/WIN.DTU.DK/matinnu/phd/projects/snakemake-prototype/workflow/notebooks/.snakemake/conda/0e315c616d79d8e6df0cee47bc2d4a09/bin/aragorn
[21:59:25] Determined aragorn version is 001002 from 'ARAGORN v1.2.38 Dean Laslett'
[21:59:25] Looking for 'barrnap' - found /home/WIN.DTU.DK/matinnu/phd/projects/snakemake-prototype/workflow/notebooks/.snakemake/conda/0e315c616d79d8e6df0cee47bc2d4a09/bin/barrnap
[21:59:25] Determined barrnap version is 000009 from 'barrnap 0.9'
[21:59:25] Looking for 'blastp' - found /home/WIN.DTU.DK/matinnu/phd/projects/snakemake-prototype/workflow/notebooks/.snakemake/conda/0e315c616d79d8e6df0cee47bc2d4a09/bin/blastp
[21:59:27] Determined blastp version is 002011 from 'blastp: 2.11.0+'
[21:59:27] Looking for 'cmpress' - found /home/WIN.DTU.DK/matinnu/phd/projects/snakemake-prototype/workflow/notebooks/.snakemake/conda/0e315c616d79d8e6df0cee47bc2d4a09/bin/cmpress
[21:59:27] Determined cmpress version is 001001 from '# INFERNAL 1.1.4 (Dec 2020)'
[21:59:27] Looking for 'cmscan' - found /home/WIN.DTU.DK/matinnu/phd/projects/snakemake-prototype/workflow/notebooks/.snakemake/conda/0e315c616d79d8e6df0cee47bc2d4a09/bin/cmscan
[21:59:27] Determined cmscan version is 001001 from '# INFERNAL 1.1.4 (Dec 2020)'
[21:59:27] Looking for 'egrep' - found /bin/egrep
[21:59:27] Looking for 'find' - found /usr/bin/find
[21:59:27] Looking for 'grep' - found /bin/grep
[21:59:27] Looking for 'hmmpress' - found /home/WIN.DTU.DK/matinnu/phd/projects/snakemake-prototype/workflow/notebooks/.snakemake/conda/0e315c616d79d8e6df0cee47bc2d4a09/bin/hmmpress
[21:59:27] Determined hmmpress version is 003003 from '# HMMER 3.3.2 (Nov 2020); http://hmmer.org/'
[21:59:27] Looking for 'hmmscan' - found /home/WIN.DTU.DK/matinnu/phd/projects/snakemake-prototype/workflow/notebooks/.snakemake/conda/0e315c616d79d8e6df0cee47bc2d4a09/bin/hmmscan
[21:59:27] Determined hmmscan version is 003003 from '# HMMER 3.3.2 (Nov 2020); http://hmmer.org/'
[21:59:27] Looking for 'java' - found /home/WIN.DTU.DK/matinnu/phd/projects/snakemake-prototype/workflow/notebooks/.snakemake/conda/0e315c616d79d8e6df0cee47bc2d4a09/bin/java
[21:59:27] Looking for 'makeblastdb' - found /home/WIN.DTU.DK/matinnu/phd/projects/snakemake-prototype/workflow/notebooks/.snakemake/conda/0e315c616d79d8e6df0cee47bc2d4a09/bin/makeblastdb
[21:59:28] Determined makeblastdb version is 002011 from 'makeblastdb: 2.11.0+'
[21:59:28] Looking for 'minced' - found /home/WIN.DTU.DK/matinnu/phd/projects/snakemake-prototype/workflow/notebooks/.snakemake/conda/0e315c616d79d8e6df0cee47bc2d4a09/bin/minced
[21:59:28] Determined minced version is 004002 from 'minced 0.4.2'
[21:59:28] Looking for 'parallel' - found /home/WIN.DTU.DK/matinnu/phd/projects/snakemake-prototype/workflow/notebooks/.snakemake/conda/0e315c616d79d8e6df0cee47bc2d4a09/bin/parallel
[21:59:28] Determined parallel version is 20210422 from 'GNU parallel 20210422'
[21:59:28] Looking for 'prodigal' - found /home/WIN.DTU.DK/matinnu/phd/projects/snakemake-prototype/workflow/notebooks/.snakemake/conda/0e315c616d79d8e6df0cee47bc2d4a09/bin/prodigal
[21:59:28] Determined prodigal version is 002006 from 'Prodigal V2.6.3: February, 2016'
[21:59:28] Looking for 'prokka-genbank_to_fasta_db' - found /home/WIN.DTU.DK/matinnu/phd/projects/snakemake-prototype/workflow/notebooks/.snakemake/conda/0e315c616d79d8e6df0cee47bc2d4a09/bin/prokka-genbank_to_fasta_db
[21:59:28] Looking for 'sed' - found /bin/sed
[21:59:28] Looking for 'tbl2asn' - found /home/WIN.DTU.DK/matinnu/phd/projects/snakemake-prototype/workflow/notebooks/.snakemake/conda/0e315c616d79d8e6df0cee47bc2d4a09/bin/tbl2asn
[21:59:28] Determined tbl2asn version is 025007 from 'tbl2asn 25.7   arguments:'
[21:59:28] Using genetic code table 11.
[21:59:28] Loading and checking input file: /data/matinnu/zhiyan/genomes/P8-2B-3/prokkainput.fna
[21:59:28] Wrote 1 contigs totalling 7099240 bp.
------------------------------------------------------------------
[21:59:35] Predicting Ribosomal RNAs
[21:59:35] Running Barrnap with 12 threads
[21:59:36] 1 P8-2B-3_2 1043221 5S ribosomal RNA
[21:59:36] 2 P8-2B-3_2 1043432 23S ribosomal RNA
[21:59:36] 3 P8-2B-3_2 1046873 16S ribosomal RNA
[21:59:36] 4 P8-2B-3_2 1522087 5S ribosomal RNA
[21:59:36] 5 P8-2B-3_2 1522296 23S ribosomal RNA
[21:59:36] 6 P8-2B-3_2 1525727 16S ribosomal RNA
[21:59:36] 7 P8-2B-3_2 2462338 5S ribosomal RNA
[21:59:36] 8 P8-2B-3_2 2462549 23S ribosomal RNA
[21:59:36] 9 P8-2B-3_2 2465979 16S ribosomal RNA
[21:59:36] 10 P8-2B-3_2 3393049 5S ribosomal RNA
[21:59:36] 11 P8-2B-3_2 3393260 23S ribosomal RNA
[21:59:36] 12 P8-2B-3_2 3396699 16S ribosomal RNA
[21:59:36] 13 P8-2B-3_2 4264823 16S ribosomal RNA
[21:59:36] 14 P8-2B-3_2 4266672 23S ribosomal RNA
[21:59:36] 15 P8-2B-3_2 4269889 5S ribosomal RNA
[21:59:36] 16 P8-2B-3_2 5942064 16S ribosomal RNA
[21:59:36] 17 P8-2B-3_2 5943902 23S ribosomal RNA
[21:59:36] 18 P8-2B-3_2 5947158 5S ribosomal RNA
[21:59:36] 19 P8-2B-3_2 6395943 16S ribosomal RNA
[21:59:36] 20 P8-2B-3_2 6397783 23S ribosomal RNA
[21:59:36] 21 P8-2B-3_2 6400999 5S ribosomal RNA
[21:59:36] Found 21 rRNAs

Should we just use barrnap?

These are the two differences params used: latest with rnammer installed --force --proteins resources/prokka_db/reference.gbff --prefix P8-2B-3.1 --genus Streptomyces --species sp. --strain P8-2B-3 --cdsrnaolap --cpus 8 --rnammer --increment 10 --evalue 1e-05 data/interim/fasta/P8-2B-3.1.fna

old run missing rnammer dependencies --proteins ../resources/Actinos_6species.gbff --prefix P8-2B-3 --genus Streptomyces --species sp._NRRL_B-3253_0.9921 --strain P8-2B-3 --cdsrnaolap --cpus 12 --rnammer --increment 10 --evalue 1e-05 /data/matinnu/zhiyan/genomes/P8-2B-3/prokkainput.fna

matinnuhamunada commented 2 years ago

Found the issue here: https://github.com/tseemann/prokka/issues/360

I think you need to install rnammer yourself and add it to your PATH. This is probably because, although rnammer is free, the developers want you to agree with their license term before allowing you to use it. Therefore it cannot simply be bundled with prokka.

You can get rnammer here

But beware, there are a few nuisances when installing (rnammer has not been updated in a while and some of the dependencies changed). You have to make sure to have hmmer2 installed and the path it's binaries set in the rnammer script. It will NOT work with hmmer3 (it will just yield empty results with the new versions).

Also you likely have to delete all instances of "--cpu=1" from the core-rnammer script. Details on these problems and on how to solve them are given here (rnammer is still my favorite rRNA-gene finder though. Really hope they'll release an updated version of it sometime)

matinnuhamunada commented 2 years ago

Prokka will use barnapp in 918ef2f

matinnuhamunada commented 2 years ago

As RNAmmer have academic license, there is no way to fully automate the installation in Snakemake. Instead, I'll add it to the resources, and create a symlink to the prokka environment bin Path.

Write myself a note to make RNAmmer work here: https://matinnuhamunada.github.io/posts/2022/01/rnammer

matinnuhamunada commented 2 years ago

As RNAmmer have academic license, there is no way to fully automate the installation in Snakemake. Instead, I'll add it to the resources, and create a symlink to the prokka environment bin Path.

Write myself a note to make RNAmmer work here: https://matinnuhamunada.github.io/posts/2022/01/rnammer

To achieve this, first add a parameter for rnammerin the config.yaml:

#### RULE CONFIGURATION ####
# rules: set value to TRUE if you want to run the analysis or FALSE if you don't
rules:
  bigscape: TRUE
  mlst: TRUE
  refseq_masher: TRUE
  seqfu: TRUE
  eggnog: FALSE
  rnammer: TRUE

To make sure the DAG works, set an output of rule rnammer_setup that will be generated in the common.smk:

##### Customizable Analysis #####
def get_final_output():
    """
    Generate final output for rule all given a TRUE value in config["rules"]
    """
    # dictionary of rules and its output files
    rule_dict = {"mlst" : expand("data/interim/mlst/{strains}_ST.csv", strains = STRAINS),
                "eggnog" : expand("data/interim/eggnog/{strains}/", strains = STRAINS),
                "refseq_masher" : expand("data/interim/refseq_masher/{strains}_masher.csv", strains = STRAINS),
                "automlst_wrapper" : "data/interim/automlst_wrapper/raxmlpart.txt.treefile",
                "roary" : expand("data/interim/roary/{name}", name=PROJECT_IDS),
                "bigscape" : expand("data/interim/bigscape/{name}_antismash_{version}/index.html", version=dependency_version["antismash"], name=PROJECT_IDS),
                "seqfu" : "data/processed/tables/df_seqfu_stats.csv",
                "rnammer": "resources/rnammer_test.txt" 
                }

    # get keys from config
    opt_rules = config["rules"].keys()

    # if values are TRUE add output files to rule all
    final_output = [rule_dict[r] for r in opt_rules if config["rules"][r]]

    return final_output

The path to the pre-installed RNAmmer should be added in the config.yaml:

#### RESOURCES CONFIGURATION ####
# resources : the location of the resources to run the rule. The default location is at "resources/{resource_name}".
resources_path: 
  antismash_db: /data/a/matinnu/data/bgcflow_resources/antismash_db
  eggnog_db: /data/a/matinnu/data/bgcflow_resources/eggnog_db
  BiG-SCAPE: /data/a/matinnu/data/bgcflow_resources/BiG-SCAPE
  RNAmmer: /data/a/matinnu/data/bgcflow_resources/rnammer-1.2

Then, make a conditional in prokka.smk:

if config["rules"]["rnammer"] == True:
    prokka_params_rna = "--rnammer"
    rule rnammer_setup:
        output: 
            "resources/rnammer_test.txt" 
        priority: 50
        conda:
            "../envs/prokka.yaml"
        log: "workflow/report/logs/rnammer_setup.log"
        shell:
            """
            ln -s $PWD/resources/RNAmmer/rnammer $CONDA_PREFIX/bin/rnammer 2>> {log}
            rnammer -S bac -m lsu,ssu,tsu -gff - example/ecoli.fsa >> {output}
            """
else:
    prokka_params_rna = ""
    pass

The priority was set to make sure it runs first. The conditional also generated a variable for prokka parameters: --rnammer or an empty string, which then passed to rule prokka params - rna_detection:

rule prokka:
    input: 
        fna = "data/interim/fasta/{strains}.fna",
        org_info = "data/interim/prokka/{strains}/organism_info.txt",
        refgbff = expand("resources/prokka_db/reference_{name}.gbff", name=PROJECT_IDS)
    output:
        gff = "data/interim/prokka/{strains}/{strains}.gff",
        faa = "data/interim/prokka/{strains}/{strains}.faa",
        gbk = "data/interim/prokka/{strains}/{strains}.gbk",
    conda: 
        "../envs/prokka.yaml"
    log: "workflow/report/logs/{strains}/prokka_run.log"
    params:
        increment = 10, 
        evalue = "1e-05",
        rna_detection = prokka_params_rna,
        refgbff = lambda wildcards: get_prokka_refdb(wildcards, DF_SAMPLES)
    threads: 8
    shell:
        """
        prokka --outdir data/interim/prokka/{wildcards.strains} --force {params.refgbff} --prefix {wildcards.strains} --genus "`cut -d "," -f 1 {input.org_info}`" --species "`cut -d "," -f 2 {input.org_info}`" --strain "`cut -d "," -f 3 {input.org_info}`" --cdsrnaolap --cpus {threads} {params.rna_detection} --increment {params.increment} --evalue {params.evalue} {input.fna}
        cat data/interim/prokka/{wildcards.strains}/{wildcards.strains}.log > {log}
        """
matinnuhamunada commented 2 years ago

Little fix on rnammer: Can't open example/ecoli.fsa: No such file or directory at /home/matinnu/bgcflow/.snakemake/conda/53988b3ff79af022ea1ba61b2461b84f/bin/rnammer line 104.

OmkarSaMo commented 2 years ago

Error in rnammer setup

Here's the contents of error log file for rnammer_setup.log

ln: failed to create symbolic link '/home/bgcflow/anaconda3/envs/snakemake/bin/rnammer': File exists

OmkarSaMo commented 2 years ago

Error in rnammer setup

Here's the contents of error log file for rnammer_setup.log

ln: failed to create symbolic link '/home/bgcflow/anaconda3/envs/snakemake/bin/rnammer': File exists

Please ignore above error. I did not read the documentation properly to understand that this feature needs to be manually installed and has licensing issues. As described here

Write myself a note to make RNAmmer work here: https://matinnuhamunada.github.io/posts/2022/01/rnammer