galaxyproject / usegalaxy-playbook

Ansible Playbook for usegalaxy.org
Academic Free License v3.0
30 stars 25 forks source link

Roary 3.13.0 fails at usegalaxy.org -- likely installation issue #293

Closed jennaj closed 3 years ago

jennaj commented 4 years ago

Tool: Roary the pangenome pipeline - Quickly generate a core gene alignment from gff3 files (Galaxy Version 3.13.0)

Workaround for end-users: Until the tool is corrected at usegalaxy.org and this ticket closes out, it can be used instead at usegalaxy.eu.

Troubleshooting: seems to have three problems

  1. working directory path is maybe incorrect?
  2. missing a datatype? ("dot")?
  3. error help reports that "duplicated inputs were used" -- they weren't

Test histories: use some of the tutorial data from here: https://training.galaxyproject.org/training-material/topics/assembly/

Error for the usegalaxy.org test. Is the same as reported at Galaxy Help here: https://help.galaxyproject.org/t/roary-fatal-error-exit-code-2/3164

Dataset Error An error occurred while running the tool toolshed.g2.bx.psu.edu/repos/iuc/roary/roary/3.13.0.

Error Details Execution resulted in the following messages:

Fatal error: Exit code 2 () Tool generated the following standard error:

Use of uninitialized value in require at /cvmfs/main.galaxyproject.org/deps/_conda/envs/__roary@3.13.0/lib/site_perl/5.26.2/x86_64-linux-thread-multi/Encode.pm line 61. Usage: extract_proteome_from_gff [options] *.gff Take in GFF files and create FASTA files of the protein sequences

Options: -o STR output suffix [proteome.faa] -t INT translation table [11] -f filter sequences with missing data -v verbose output to STDOUT -d STR output directory -w print version and exit -h this help message

For further info see: http://sanger-pathogens.github.io/Roary/ Usage: extract_proteome_from_gff [options] *.gff Take in GFF files and create FASTA files of the protein sequences

Options: -o STR output suffix [proteome.faa] -t INT translation table [11] -f filter sequences with missing data -v verbose output to STDOUT -d STR output directory -w print version and exit -h this help message

For further info see: http://sanger-pathogens.github.io/Roary/ Cant open file: /galaxy-repl/main/jobdir/027/303/27303938/working/out/Sx2fKLLwy8/Prokka on data 11: gff.gff.proteome.faa Galaxy job runner generated the following standard error:

WARNING:galaxy.model:Datatype class not found for extension 'dot' WARNING:galaxy.model:Datatype class not found for extension 'dot' WARNING:galaxy.model:Datatype class not found for extension 'dot' WARNING:galaxy.model:Datatype class not found for extension 'dot' WARNING:galaxy.model:Datatype class not found for extension 'dot' WARNING:galaxy.model:Datatype class not found for extension 'dot' WARNING:galaxy.model:Datatype class not found for extension 'dot' WARNING:galaxy.model:Datatype class not found for extension 'dot' WARNING:galaxy.model:Datatype class not found for extension 'dot' WARNING:galaxy.model:Datatype class not found for extension 'dot' Detected Common Potential Problems The tool was executed with one or more duplicate input datasets. This frequently results in tool errors due to problematic input choices.

ping @davebx @mvdbeek @natefoo

natefoo commented 4 years ago

It's unclear to me what's broken here but apparently the Perl error is not the problem.

natefoo commented 4 years ago

One thing that stands out: why does the tool wrapper copy its inputs? Are symlinks not sufficient?

cp '/galaxy-repl/main/files/038/407/dataset_38407948.dat' 'Prokka on data 5: gff.gff' &&  cp '/galaxy-repl/main/files/038/411/dataset_38411630.dat' 'Prokka on data 11: gff.gff' &&   roary -f out -p ${GALAXY_SLOTS:-1} -e -n -i '95' -cd '99.0' -g '50000'  -t '11' -iv '1.5'  'Prokka on data 5: gff.gff' 'Prokka on data 11: gff.gff'

@takadonet any thoughts? Looks like you originally implemented the input handling.

Takadonet commented 4 years ago

Reason being is that Roary will follow the softlink and use that file name instead of the soft link name. All names would be dataset_###

natefoo commented 4 years ago

Ahhh gotcha. Blech, ok, thanks.

natefoo commented 4 years ago

Ah, it's not handling spaces in the input filenames:

Error: Cant access file /galaxy-repl/main/jobdir/027/303/27303938/working/Prokka
Error: Cant access file /galaxy-repl/main/jobdir/027/303/27303938/working/Prokka
Takadonet commented 4 years ago

Probably. That is my mistake assuming that file name would be command line friendly.

natefoo commented 4 years ago

They're quoted, though, so I think roary is not reading those params correctly?

Takadonet commented 4 years ago

Roary cannot handle them.

natefoo commented 4 years ago

@bgruening did you fix this manually somehow on usegalaxy.eu?

bgruening commented 4 years ago

I don't think so.

(venv) galaxy@sn04:~/shed_tools/toolshed.g2.bx.psu.edu/repos/iuc/roary/e02e9af2743f/roary$ hg diff
(venv) galaxy@sn04:~/shed_tools/toolshed.g2.bx.psu.edu/repos/iuc/roary/e02e9af2743f/roary$ 
jennaj commented 4 years ago

Summary:

  1. Roary is picky about the input dataset name format
  2. Spaces should be avoided
  3. All inputs should have a distinct name

Workarounds for end-users working with individual datasets:

  • If executing tools from the History: Click on the pencil icon for an input gff dataset to reach the Edit Attributes forms. On the first tab, modify the file name, removing any spaces, then save. Do this for all gff inputs to avoid the naming problem. Rerun Roary using those renamed inputs.
  • If executing tools from a Workflow: The output gff dataset generated by an upstream tool (likely Prokka) can be renamed to remove spaces as a "post job action" within the Workflow itself. This will pass the renamed gff inputs to Roary and avoid the naming problem.

Notes

The upstream tool commonly used (Prokka), when executed in Galaxy on individual datasets, will always insert spaces into the result dataset names.

When Prokka is executed with a collection input, spaces in dataset names are avoided from the start. Collections and workflows are worth learning about. If interested, please see:

jennaj commented 2 years ago

Update:

If an intermediate parsing job fails, the tool outputs empty "green" results. This is confusing for users. Seems to be more likely to happen with a large number of gff inputs but that isn't confirmed.

Can sub-job tasks that fail be trapped better in the wrapper? Could failed sub-jobs be ignored or rerun? If just a few sub-jobs fail, maybe allow the user to chose to ignore and have what was skipped output to a job log shown in the history? At a minimum, if all outputs will be empty, red error dataset results would be better.

Example discussion: https://help.galaxyproject.org/t/roary-core-genome-alignment-file-is-empty/4054/13

jennaj commented 1 year ago

Tool version is now at 3.13.0+galaxy2

KargKarg commented 1 year ago

Hello everyone,

I ran Roary for 219 genomes, but in the presence/absence matrix I only have 197 genomes.

Does anyone know the reason ?