fmalmeida / bacannot

Generic but comprehensive pipeline for prokaryotic genome annotation and interrogation with interactive reports and shiny app.
https://bacannot.readthedocs.io/en/latest/
GNU General Public License v3.0
96 stars 9 forks source link

ISLANDPATH failed when there is no CDS in Bakta annotated gbk file #62

Closed rujinlong closed 1 year ago

rujinlong commented 1 year ago

https://github.com/fmalmeida/bacannot/blob/7fb675b1688d3d2af85f7cf206ce0fc6a1e82858/modules/MGEs/islandpath.nf#L22

Line 22 of islandpath.nf detect if there is CDS in gbk file. However, when annotation using Bakta, gbk file will always have a line with "CDS" in the COMMENT section, as show in line 21 below,

  1 # test
  2 LOCUS       contig_6                5500 bp    DNA     linear   UNK 12-SEP-2022
  3 DEFINITION  test contig_6, whole genome shotgun sequence.
  4 ACCESSION   contig_6
  5 VERSION     contig_6
  6 KEYWORDS    .
  7 SOURCE      test
  8   ORGANISM  test
  9             .
 10 COMMENT     Annotated with Bakta
 11             Software: v1.5.0
 12             Database: v4.0
 13             DOI: 10.1099/mgen.0.000685
 14             URL: github.com/oschwengers/bakta
 15
 16             ##Genome Annotation Summary:##
 17             Annotation Date                :: 09/12/2022, 12:30:32
 18             Annotation Pipeline            :: Bakta
 19             Annotation Software version    ::  v1.5.0
 20             Annotation Database version    ::  v4.0
 21             CDSs                           ::     0
 22             tRNAs                          ::     2
 23             tmRNAs                         ::     0
 24             rRNAs                          ::     3
 25             ncRNAs                         ::     0
 26             regulatory ncRNAs              ::     0
 27             CRISPR Arrays                  ::     0
 28             oriCs/oriVs                    ::     0
 29             oriTs                          ::     0
 30             gaps                           ::     0
 31             pseudogenes                    ::     0
 32 FEATURES             Location/Qualifiers
 33      source          1..5500
 34                      /mol_type="genomic DNA"
 ...

This will make ISLANDPATH fail when there is no true CDS in the sequence.

fmalmeida commented 1 year ago

Dear @rujinlong, Thanks for using the pipeline and reporting this issue in such an informative manner 😄 That is a nice spotting!

Before I go into coding, I think it is nice to first brainstorm the best solution. Do you have any idea of nice and clean approach to do in such cases?

Maybe I can just change the way I check for gbk files with CDS sequences. Maybe with grep -q "CDS" plus checking for this Bakta comment line?

My issue on this one is assuming that the amount of whitespaces will never vary.

Other possibility may be to make it run, but add something in the comment that triggers islandpath to ignore it's error.

My issue on this one is also ignoring TRUE errors and missing relevant logs.

fmalmeida commented 1 year ago

I was able to think in something like this:

( sed '/CDS.*::.*0/d' test.gbk | grep -q CDS ) && echo yes || echo no

I will commit it to the branch and invite you to test it.

fmalmeida commented 1 year ago

I've just commited to the new branch for this hotfix. Can you give it a try by appending:

-r 62-islandpath-failed-when-there-is-no-cds-in-bakta-annotated-gbk-file -latest

To your command line?

rujinlong commented 1 year ago

I've just commited to the new branch for this hotfix. Can you give it a try by appending:

-r 62-islandpath-failed-when-there-is-no-cds-in-bakta-annotated-gbk-file -latest

To your command line?

Great. This works 👍

fmalmeida commented 1 year ago

Okidokie. I’ll make a hotfix release out of it 😄