fmalmeida / bacannot

Generic but comprehensive pipeline for prokaryotic genome annotation and interrogation with interactive reports and shiny app.
https://bacannot.readthedocs.io/en/latest/
GNU General Public License v3.0
96 stars 9 forks source link

emote origin did not advertise Ref for branch refs/heads/116-integron_finder_2gff-terminated-with-an-error #118

Closed JavariaAshraf closed 4 months ago

JavariaAshraf commented 4 months ago

Hi Fmalmeida, I am getting the following error as I try to resume my pipeline analysis. Pulling fmalmeida/bacannot ... Remote origin did not advertise Ref for branch refs/heads/116-integron_finder_2gff-terminated-with-an-error. This Ref may not exist in the remote or may be hidden by permission settings. The previous error was in circos, ` CIRCOS ERROR

    cwd:
    conf

    command: /opt/conda/bin/circos

You have asked to draw [213] ideograms, but the maximum is currently set at
[200]. To increase this number change max_ideograms in etc/housekeeping.conf.
Keep in mind that drawing that many ideograms may create an image that is too
busy and uninterpretable.`

I tried to fix it with changing the 200 with 213 in housekeeping.config file, but the above error is not letting the program to resume. Kindly help.

fmalmeida commented 4 months ago

Hi @JavariaAshraf , The error is because this branch does not exist anymore. I have merged the code in a patch release last week when we finished the other issue.

Try running pointing it to the new version instead of the missing branch, like:

nextflow run fmalmeida/bacannot -r v3.3.2 …

Cheers.

fmalmeida commented 4 months ago

One more thing, @JavariaAshraf

However, even after solving the problem of the non-existing branch, I still believe it will fail because you cannot modify the file and run it again. When resuming, nextflow will create a new working directory for the job, and your modifications would be ignored.

Thus, because this is the very last module and having a circos with so many points is not meaningful anyways, I suggest you create the following file, called circos.config in order to make the pipeline ignore the error in this module, and run the pipeline with it.

contents of circos.config file

process {
    withName: 'CIRCOS' {
        errorStrategy = 'ignore'
    }
}

And run like this:

nextflow run fmalmeida/bacannot -r v3.3.2 -c circos.config <rest of your params> -resume

Finally, because this CIRCOS module is the last one, and it is not meaningful, I will add in the weekend two parameters to manage it, one to allow someone to skip it, and another one to allow someone to easily ignore the errors it produces (like with the config I shared with you will do).

The difference is:

In both cases, at least, it should avoid breaking the pipeline.

Can you give it a try, using this custom config and the correct revision as suggested and see if it helps?

Depending on the feedback I will know what to set up as an action plan.

Cheers 😄

JavariaAshraf commented 4 months ago

Hi @fmalmeida, I started the run as suggested above: Got this error: Kindly review

`Caused` by:
  Process `BACANNOT:MERGE_ANNOTATIONS (vibrio31)` terminated with an error exit status (1)

Command executed:

  # Rename gff and remove sequence entries
  # bakta has region entries
  awk '$3 == "CDS"' prokka_gff | grep "ID=" > vibrio31.gff ;

  ## Increment GFF with custom annotations
  ### VFDB
  if [ ! $(cat vibrio31_vfdb_blastn_onGenes.txt | wc -l) -le 1 ]
  then
    addBlast2Gff.R -i vibrio31_vfdb_blastn_onGenes.txt -g vibrio31.gff -o vibrio31.gff -d VFDB -t Virulence ;
    grep "VFDB" vibrio31.gff > virulence_vfdb.gff ;
  fi

  ### Victors
  if [ ! $(cat vibrio31_victors_blastp_onGenes.txt | wc -l) -le 1 ]
  then 
    addBlast2Gff.R -i vibrio31_victors_blastp_onGenes.txt -g vibrio31.gff -o vibrio31.gff -d Victors -t Virulence ;
    grep "Victors" vibrio31.gff > virulence_victors.gff ;
  fi

  ### KEGG Orthology
  ## Reformat KOfamscan Output
  if [ ! $(cat vibrio31_ko_forKEGGMapper.txt | wc -l) -eq 0 ]
  then
    awk \
      -F'\t' \
      -v OFS='\t' \
      '{x=$1;$1="";a[x]=a[x]$0}END{for(x in a)print x,a[x]}' \
      vibrio31_ko_forKEGGMapper.txt  | \
    sed \
      -e 's/\t/,/g' \
      -e 's/,,/\t/g' | \
    awk  '$2!=""' > formated.txt ;
    addKO2Gff.R -i formated.txt -g vibrio31.gff -o vibrio31.gff -d KEGG ;
  fi

  ### ICEs
  if [ ! $(cat vibrio31_iceberg_blastp_onGenes.txt | wc -l) -le 1 ]
  then
    addBlast2Gff.R -i vibrio31_iceberg_blastp_onGenes.txt -g vibrio31.gff -o vibrio31.gff -d ICEberg -t ICE ;
    grep "ICEberg" vibrio31.gff > ices_iceberg.gff ;
  fi

  ### Prophages
  if [ ! $(cat vibrio31_phast_blastp_onGenes.txt | wc -l) -le 1 ]
  then
    addBlast2Gff.R -i vibrio31_phast_blastp_onGenes.txt -g vibrio31.gff -o vibrio31.gff -d PHAST -t Prophage ;
    grep "PHAST" vibrio31.gff > prophages_phast.gff ;
  fi

  ### Resistance
  #### RGI
  if [ ! $(cat RGI_vibrio31.txt | wc -l) -le 1 ]
  then
    addRGI2gff.R -g vibrio31.gff -i RGI_vibrio31.txt -o vibrio31.gff ;
    grep "CARD" vibrio31.gff > resistance_card.gff ;
  fi

  #### AMRFinderPlus
  if [ ! $(cat AMRFinder_resistance-only.tsv | wc -l) -le 1 ]
  then 
    addNCBIamr2Gff.R -g vibrio31.gff -i AMRFinder_resistance-only.tsv -o vibrio31.gff -t Resistance -d AMRFinderPlus ;
    grep "AMRFinderPlus" vibrio31.gff > resistance_amrfinderplus.gff ;
  fi

  #### Resfinder
  if [ ! $(cat results_tab.gff | wc -l) -eq 0 ]
  then
    bedtools intersect -a results_tab.gff -b vibrio31.gff -wo | sort -k19,19 -r | awk -F '\t' '!seen[$9]++' > resfinder_intersected.txt ;
    addBedtoolsIntersect.R -g vibrio31.gff -t resfinder_intersected.txt --type Resistance --source Resfinder -o vibrio31.gff ;
    grep "Resfinder" vibrio31.gff > resistance_resfinder.gff ;
    rm -f resfinder_intersected.txt ;
  fi

  #### Custom Blast databases
  for file in input.11 ;
  do
    if [ ! $(cat $file | wc -l) -eq 0 ]
    then
      db=${file%%_custom_db.gff} ;
      bedtools intersect -a ${file} -b vibrio31.gff -wo | sort -k19,19 -r | awk -F '\t' '!seen[$9]++' > bedtools_intersected.txt ;
      addBedtoolsIntersect.R -g vibrio31.gff -t bedtools_intersected.txt --type "CDS" --source "${db}" -o vibrio31.gff ;
      grep "${db}" vibrio31.gff > custom_database_${db}.gff ;
      rm -f bedtools_intersected.txt ;
    fi
  done

  ### digIS transposable elements
  touch transposable_elements_digis.gff
  if [ -s digis_gff ]
  then
    ( cat digis_gff | sed 's/id=/ID=/g' > transposable_elements_digis.gff && rm digis_gff ) ;
    cat vibrio31.gff transposable_elements_digis.gff | bedtools sort > tmp.out.gff ;
    ( cat tmp.out.gff > vibrio31.gff && rm tmp.out.gff );
  fi

  ### integron_finder results
  ### integrons are unique / complete elements and should not be intersected
  cat vibrio31.gff vibrio31_integrons.gff | bedtools sort > tmp.gff ;
  cat tmp.gff > vibrio31.gff
  rm tmp.gff

Command exit status:
  1

Command output:
  (empty)

Command error:
  Error: malformed GFF entry at line 3545. Coordinate detected that is < 1. Exiting.

Work dir:
  `/home/cdc-bioinfo/Vibrio-Feb2024/work/a9/4e823fd5dee91b46737e4606324995`

Please help.

fmalmeida commented 4 months ago

Hi @JavariaAshraf ,

Once again it seems you have 0-based annotation because something was found in the very first base.

However this time it is not clear which one is it. Can you send me this working directory (/home/cdc-bioinfo/Vibrio-Feb2024/work/a9/4e823fd5dee91b46737e4606324995) with all the files that are available inside it?

I can take a look during the week. In the meantime I would recommend removing the “problematic” genome from the run.

Cheers.

JavariaAshraf commented 4 months ago

Please see the attachment.

4e823fd5dee91b46737e4606324995.zip

Do I have to re-run the pipeline from start? It takes a lot of time. Any way to resume from last step? The resume option doesn't work, it starts the pipeline from first step. Please guide. Thanks

fmalmeida commented 4 months ago

So better to wait for a fix. I believe it is not resuming because the samplesheet is different (when removing the genome).

fmalmeida commented 4 months ago

I think it may still be the integron finder file. Can you send me these files that were not copied in the dir (only the links came):

/home/cdc-bioinfo/Vibrio-Feb2024/work/e8/6a17b21b8ad6b40d2f34e3a95aebb5/vibrio31_integrons.gff
/home/cdc-bioinfo/Vibrio-Feb2024/work/ce/4fede0a759e0aef9aac4cd449b28a1/vibrio31_phast_blastp_onGenes.txt
/home/cdc-bioinfo/Vibrio-Feb2024/work/83/f663818f65e05cab793bd795019c3e/vibrio31_vfdb_blastn_onGenes.txt
/home/cdc-bioinfo/Vibrio-Feb2024/work/8e/2ae390aa3b9292f043f643e51f2b5a/vibrio31_victors_blastp_onGenes.txt
/home/cdc-bioinfo/Vibrio-Feb2024/work/79/a434755b9f49795debc139b78e9f58/KOfamscan/vibrio31_ko_forKEGGMapper.txt
/home/cdc-bioinfo/Vibrio-Feb2024/work/c3/b20059a379c86919f976bbbf494728/vibrio31_iceberg_blastp_onGenes.txt
/home/cdc-bioinfo/Vibrio-Feb2024/work/09/33e52bffb0c86717fd85cb2be8b7e8/RGI_vibrio31.txt
/home/cdc-bioinfo/Vibrio-Feb2024/work/c7/034a3019b2ba76d63ca5e9889c7710/resfinder/results_tab.gff
/home/cdc-bioinfo/Vibrio-Feb2024/work/37/e0acf6a9f5f210b6d007903bec86fa/AMRFinder_resistance-only.tsv
JavariaAshraf commented 4 months ago

Please find the files: They were right-protected and was unable to copy Now they are attached. 4e823fd5dee91b46737e4606324995copy.zip

fmalmeida commented 4 months ago

They are still not copied. Only the links are comming, not the real files. See below:

Screenshot from 2024-02-19 10-03-47

JavariaAshraf commented 4 months ago

4e823fd5dee91b46737e4606324995copy.zip Please see... I have renamed them. I hope they are accessible now.

fmalmeida commented 4 months ago

Hi @JavariaAshraf ,

For some reason, some of the integron finder results are with negative coordinates:

13      Integron_Finder integron        69515   74987   .       +       1       ID=integron_01;integron_type=complete
24      Integron_Finder integron        25      12675   .       +       1       ID=integron_01;integron_type=CALIN
25      Integron_Finder integron        19      9958    .       +       1       ID=integron_01;integron_type=CALIN
27      Integron_Finder integron        6936    9536    .       +       1       ID=integron_01;integron_type=complete
31      Integron_Finder integron        478     4564    .       +       1       ID=integron_01;integron_type=CALIN
32      Integron_Finder integron        66      4604    .       +       1       ID=integron_01;integron_type=CALIN
33      Integron_Finder integron        117     4047    .       +       1       ID=integron_01;integron_type=CALIN
37      Integron_Finder integron        -2      3108    .       +       1       ID=integron_01;integron_type=CALIN
38      Integron_Finder integron        2       2804    .       +       1       ID=integron_01;integron_type=CALIN
44      Integron_Finder integron        70      1709    .       +       1       ID=integron_01;integron_type=CALIN
46      Integron_Finder integron        -17     1603    .       +       1       ID=integron_01;integron_type=CALIN

I would also need that you send me the results of integron finder for this tool so I can check again the conversion to GFF module. It seems that the issue described in #116 is not yet finished.

So, I would need the files (for this specific genome) so I can first check the tool's results and make sure they are proper and then assess whether I can use other scripts for converting it to GFF to avoid this issue.

In the meantime, I have the following alternatives:

JavariaAshraf commented 4 months ago

Thank you for your quick replies. I am attaching the folder for specific genome. integron_finder_v31.zip I would also run the older version as I need the results. Thank you

fmalmeida commented 4 months ago

Okay, Let me know how it goes. In the meantime I work on the issue of the current version.

Remember to use the circos configuration to avoid the circos error when running the earlier version. Hopefully it works using that version, if not, we can investigate.

fmalmeida commented 4 months ago

Hi @JavariaAshraf , It seems that the problem is in the integron finder tool itself. I would have to open an issue in the tool's github.

Can you send me the sequences of the contigs 37 and 46, which are the problematic ones from these genomes, so that I can investigate the issue with the tool's developers.

JavariaAshraf commented 4 months ago

Hi @fmalmeida You are right. I have also installed the tool separately and it is giving the following error. integron_Finder_Error.txt Thanks.

fmalmeida commented 4 months ago

Just for reference, I have opened the issue in their git. https://github.com/gem-pasteur/Integron_Finder/issues/114. Once it is fixed, I can bring the new version to the pipeline.

The only remedy I can do, for now, is releasing a new patch release this week that allows one to skip the integron finder tool, with a param --skip_integron_finder so that if this happens, one can run the rest.

JavariaAshraf commented 4 months ago

this will be much helpful. Thank you.

fmalmeida commented 4 months ago

Hi @JavariaAshraf , While I wait for the real fix in the integron finder tool, I have added the option for skipping INTEGRON_FINDER and/or the CIRCOS module.

Before I make a release, could you give it a try?

I would ask for you to try, first, only using the genome that cause the current problem, vibrio31.

You could try to see if skipping these modules, the pipeline run successfully for this genome. If so, I can then wrap-up as a patch release.

Suggested command line

nextflow run fmalmeida/bacannot \
    -r 118-add-skip-integron-finder-param \
    -latest \
    -resume \
    --skip_integron_finder \
    --skip_circos \
    # ... the rest of your normal input params

Depending on the result, I can merge it (or not).

JavariaAshraf commented 4 months ago

Hi @fmalmeida, I have run three troubled sequences with the parameters you suggested above and the run completed smoothly. here's the screenshot. Screenshot from 2024-02-21 11-18-11

fmalmeida commented 4 months ago

Hi @JavariaAshraf ,

Thanks for the feedback. I am currently closing this issue then. I have merged the code to the dev branch (on PR #119), so, if you need to run the pipeline with these parameters you must refer to the dev branch, with nextflow run fmalmeida/bacannot -r dev -latest.

Finally, I opened a new ticket #120 so I remember to update the docker image with the new version of the integron finder tool once the devs release the fix.

Cheers, and thanks for reporting and using it.