ctmrbio / BACTpipe

BACTpipe: An assembly and annotation pipeline for bacterial genomics
https://bactpipe.readthedocs.org
MIT License
20 stars 7 forks source link

Annotation using custom curated references by prokka #35

Closed b16joski closed 6 years ago

b16joski commented 6 years ago

specify if wanted by the user for prokka to use specific reference sequences during annotation.

boulund commented 6 years ago

I think this is a useful feature, and one that existed in the previous (pre-nextflow) BACTpipe versions, right?

Can you please describe the overall process for how this should work in a bit more detail? No code or nextflow constructs needed, just a run-through of how the logic is supposed to work with all the different options. We need to present the entire logic before we can discuss how to best implement it into our current Nextflow workflow.

boulund commented 6 years ago

As far as I can tell there are at least two good ways of conditionally modifying the called command in a nextflow process:

  1. Use the approach in the official docs regarding conditional scripts. I see one definite drawback of this approach: that it requires two almost identical copies of the bash code to call to prokka (one just like we already have, and one with the extra command line argument for the reference annotation. This makes maintenance of the code more difficult, and should probably be avoided. But it is a straightforward way to implement it.
  2. Use the capability of Nextflow to execute some Groovy commands prior to launching the prokka process, using a construct something like this:

    process prokka {
    input:
    // excluded for brevity
    
    output:
    // excluded for brevity
    
    script:
    prokka_reference = ""
    if (params.prokka_reference) {
        prokka_reference = "--proteins ${params.prokka_reference}"
    }
    """
    prokka \
        --force \
        --addgenes \
        (etc...) \
        ${prokka_reference} \
        $renamed_contigs
    """   
    }

    If we include a configuration parameter prokka_reference in the configuration file, and set that parameter to a default value of false, we will have an automatic way of including the --proteins ${params.prokka_reference} line in the prokka call if the user specifies it when running BACTpipe, e.g.:

    $ nextflow run ctmrbio/BACTpipe -profile ctmrnas --reads 'path/to/my/reads/*_R{1,2}.fastq.gz' --prokka_reference path/to/my_reference_proteins.fasta

Note that I haven't tested any of the code here, these are just some thoughts I had that I wanted to share. Maybe we should call the parameter something like prokka_proteins instead of prokka_reference now that I think of it, as it is maybe more familiar to people used to running prokka on their own.

thorellk commented 6 years ago

Nice @boulund, we just discussed this in the Skype meeting :)

b16joski commented 6 years ago

I was doing some testing on using a customized reference of protein file for annotation by prokka on ctmrnas.

Introduction

First, I needed some test genomes in .fastq format and reference protein files .faa. Here I used H.pylori and copied these from Uppmax to ctmrnas in my test directory; scp josephk@milou.uppmax.uu.se:/home/josephk/joseph/test_nextflow/*.fastq .

Method

In the main pipeline code, I made some changes to the prokka process in bactpipe.nf executable as per Fredrik suggestion.

process prokka {
    tag {sample_id}
    publishDir "${params.output_dir}/prokka", mode: 'copy'

    input:
    set sample_id, file(renamed_contigs) from prokka_channel

    output:
    set sample_id, file("${sample_id}_prokka") into prokka_out

    script:
    prokka_reference = ""
    if (params.prokka_reference) {
        prokka_reference = "--proteins ${params.prokka_reference}"
    }

    """
    prokka \
        --force \
        --proteins ${params.prokka_reference} \
        --evalue 1e-9 \
        --kingdom Bacteria \
        --locustag ${sample_id} \
        --outdir ${sample_id}_prokka \
        --prefix ${sample_id} \
        --strain ${sample_id} \
        ${prokka_reference} \
        $renamed_contigs
    """
}

In the configuration file, I set the prokka_reference parameter value to false as below. prokka_reference = false

Running the pipeline

I run the pipeline as follows while specifying a specific reference file to use for annotation by prokka.

nextflow run /home/joseph.kirangwa/BACTpipev2.1/BACTpipe/bactpipe.nf -profile ctmrnas --reads "./*_R{1,2}.fastq" --prokka_reference ./*.faa

The pipeline executed well without errors at this point.

N E X T F L O W  ~  version 0.26.4
Launching `/home/joseph.kirangwa/BACTpipev2.1/BACTpipe/bactpipe.nf` [thirsty_euclid] - revision: 4507422ae2
============================================================
                          BACTpipe                          
                      Version 2.1b-dev                      
          Bacterial whole genome analysis pipeline          
              https://bactpipe.readthedocs.io               
============================================================
[warm up] executor > local
[61/bcafea] Submitted process > screen_for_contaminants (2_HP_HPAG1_7-8)
[73/269356] Submitted process > screen_for_contaminants (1_HP_26695)
[6f/d73709] Submitted process > bbduk (2_HP_HPAG1_7-8)
[8c/081b9d] Submitted process > bbduk (1_HP_26695)
[e6/96b0d6] Submitted process > shovill (2_HP_HPAG1_7-8)
[f6/09c737] Submitted process > fastqc (2_HP_HPAG1_7-8)
[b1/a86a76] Submitted process > fastqc (1_HP_26695)
[58/2c3f53] Submitted process > shovill (1_HP_26695)
[99/ed7f6f] Submitted process > prokka (2_HP_HPAG1_7-8)
[5d/8416e4] Submitted process > prokka (1_HP_26695)
[09/6149a5] Submitted process > multiqc
============================================================
         BACTpipe workflow completed without errors         
               Check output files in folder:                
                   BACTpipe_results_test                    
============================================================

The results were stored in a specified folder as well.

(base) [joseph.kirangwa@ctmr-nas BACTpipe_results_test]$ ls
bbduk  fastqc  mash.screen  multiqc  prokka  shovill

Remark

-Therefore, what is left is to extract the gram output when using assess_mash.py from the respective column, then provide this to prokka during annotation like --gram [X] Gram: -/neg +/pos (default '')

-I should also mention that I did make some change when executing the assess_mash_screen.py by providing the gram_stain.txt file as follows: --gram "$baseDir/resources/gram_stain.txt"