fmalmeida / bacannot

Generic but comprehensive pipeline for prokaryotic genome annotation and interrogation with interactive reports and shiny app.
https://bacannot.readthedocs.io/en/latest/
GNU General Public License v3.0
96 stars 9 forks source link

Refactor channel identifiers and process names #45

Closed abhi18av closed 2 years ago

abhi18av commented 2 years ago

This PR builds upon the discussion done in #38 .

I'm not sure how to test locally, so I've initiated this PR to track all changes.

Unless I have ignored something crucial, as of now the PR implements the refactor of identifiers used in the modules/workflows.

Happy to accomodate any change request :)

fmalmeida commented 2 years ago

Hi! Thanks for the the efforts.

I have checked the changes, and it seems that everything will run ok because it just changes the cases and nothing else, however a quick test is required to check if no typo is present.

When skimming through it, I found two typos:

  1. In line 15, the file "prokka.nf" is uppercase, but the filename has not changed
  2. In line 57 there is a typo in the module name

The only way I can think of testing it is running an example with a small bacterial genome and with a nanopore subset with fast5 to try the methylation calling process.

As the second one using raw reads and fast5 can be a little demanding I can make these tests myself directly using the branch from your fork after the typos have been fixed and before merging it.

😄

fmalmeida commented 2 years ago

Hi @abhi18av, thank you for accomodating the changes and all the help.

I will perform a rapid test to check if nextflow doesn't find another typo, but everything should run smoothly. As soon as the test finishes, I'll merge it in the develop branch.

Talking about the develop. Which I believe it is comming very near to trigger a new release. We could later discuss about when (which issues to be in it) to create a new one.

fmalmeida commented 2 years ago

When running the test it complained about a flye_ch channel:

$ nextflow run \
       abhi18av/bacannot -r abhinav/refactor-channel-identifiers \
       -profile test --threads 10 --output ABHINAV

Launching `abhi18av/bacannot` [astonishing_albattani] - revision: 0f58558832 [abhinav/refactor-channel-identifiers]
=================================================================
 Container-based, fmalmeida/bacannot, Genome Annotation Pipeline 
=================================================================
Input genomes  : https://github.com/fmalmeida/bacannot/raw/master/test_data/samplesheet.yaml
Output dir     : ABHINAV
Threads        : 10
Blast % ID - Virulence Genes: 90
Blast query coverage - Virulence Genes: 80
Blast % ID - AMR Genes: 90
Blast query coverage - AMR Genes: 80
Blast % ID - ICEs or Phages: 65
Blast query coverage - ICEs or Phages: 65
Blast % ID - Plasmids: 90
Blast query coverage - Plasmids: 60
Pipeline Release: abhinav/refactor-channel-identifiers
Current home   : /home/falmeida
Current user   : falmeida
Current path   : /work1/falmeida/bacannot_tests
Configuration file: /home/falmeida/.nextflow/assets/abhi18av/bacannot/nextflow.config
==============================================================
No such variable: flye_ch

 -- Check script '/home/falmeida/.nextflow/assets/abhi18av/bacannot/./workflows/bacannot.nf' at line: 132 or see '.nextflow.log' file for more details

The problem stated is in line 132 where you've changed it to flye_ch and unicycler_ch:

PROKKA(parsed_inputs.annotation_ch.mix(flye_ch.out[1], unicycler_ch.out[1]))

However, in this step, prokka module expects the outputs from assemblers modules, so it should be something like:

PROKKA(parsed_inputs.annotation_ch.mix(FLYE.out[1], UNICYCLER.out[1]))
abhi18av commented 2 years ago

Thanks for the catch again @fmalmeida https://github.com/fmalmeida/bacannot/pull/45#issuecomment-998998287 , I've addressed it in https://github.com/fmalmeida/bacannot/pull/45/commits/03cde4724a3be38e357dd0010302927d1382df94

fmalmeida commented 2 years ago

Hi @abhi18av, thanks for the change. I've tried it again and it now complained the following:

Missing process or function with name 'iceberg'

 -- Check script '/home/falmeida/.nextflow/assets/abhi18av/bacannot/main.nf' at line: 135 or see '.nextflow.log' file for more details

The problem seems in line 209:

// ICEs search
      if (params.skip_iceberg_search == false) {
        // ICEberg db
        iceberg(PROKKA.out[4], PROKKA.out[3])
        iceberg_output_ch = ICEBERG.out[1]
        iceberg_output_2_ch = ICEBERG.out[2]
      } else {
        iceberg_output_ch = Channel.empty()
        iceberg_output_2_ch = Channel.empty()
      }
abhi18av commented 2 years ago

Ah, okay. I've now fixed it with https://github.com/fmalmeida/bacannot/pull/45/commits/6b77030f608c3577be086eb18a8bda4539eba771

(Feliz Natal! 🎄🎅 )

fmalmeida commented 2 years ago

Feliz Natal! 🌲

fmalmeida commented 2 years ago

It almost worked, but it know complained about this line:

      /*
          Eighth step -- Merge all annotations with the same Prefix value in a single Channel
      */
      annotations_files_ch = PROKKA.out[3].join(PROKKA.out[1])
                                       .join(MLST.out[0])
                                       .join(BARRNAP.out[0])
                                       .join(COMPUTE_GC.out[0])
                                       .join(kofamscan_output_ch, remainder: true)
                                       .join(vfdb_output_ch,      remainder: true)
                                       .join(victors_output_ch,   remainder: true)
                                       .join(amrfinder_output_ch, remainder: true)
                                       .join(resfinder_gff_ch,    remainder: true)
                                       .join(rgi_output_ch,       remainder: true)
                                       .join(iceberg_output_ch,   remainder: true)
                                       .join(phast_output_ch,     remainder: true)
                                       .join(phigaro_output_2_ch, remainder: true)
                                       .join(FIND_GIS.out[0],  remainder: true)

      // Contatenation of annotations in a single GFF file
      MERGE_ANNOTATIONS(ANNOTATIONS_FILES.join(DIGIS.out[1],     remainder: true))

The annotation files channel is not ANNOTATIONS_FILES, but annotations_files_ch. I believe it could even be changed to annotation_files_ch.

After that I believe it would work.

abhi18av commented 2 years ago

Hey @fmalmeida , fixed that one now and keeping my 🤞 😄

As an aside, to speed up such code reviews Github has a couple features

  1. The first is called Suggestions - which helps in adding the exact code change which you feel is the cause for these problems.

  2. The other one is gh the Github CLI, which allows you to contribute directly to this branch (even if it's from a fork, I think) with command like gh pr checkout 45

Apart from this, happy to address any other issues.

abhi18av commented 2 years ago

Ah, there's still another one you found. Apologies for this Felipe! 😅

I'd really appreciate if there's a light-weight testing data I could use on my end - hopefully this would avoid these issues in future

fmalmeida commented 2 years ago

Hi @abhi18av,

No worries. I still not checked if it finished properly. I will check it again after 06/Jan.

Then, I can also think about a small dataset to share with you. The dataset just can't be too small in order to properly test the majority of processes.

For now, there are two 30X bacterial seq reads available with -profile test. I know it is not too small so it may take ~40 min to finish. But at least it enables the check-up of the NF code itself.

After coming back from vacations I will put more efforts on find a more appropriate testing dataset.

Furthermore,

Happy new year, and happy celebrations. All the best and thank you for helping with the code 🥳😁

fmalmeida commented 2 years ago

The tests here have finished successfully, so I will now merge this PR.

I have created a new issue #46 to handle this "problem" of a smaller dataset for quicker tests.

😄