EBI-Metagenomics / pipeline-v5

This repository contains all CWL descriptions of the MGnify pipeline version 5.0.
https://www.ebi.ac.uk/metagenomics/
Apache License 2.0
22 stars 20 forks source link

A number of typos/bugs with suggestions for some #21

Closed amkibriya closed 4 years ago

amkibriya commented 4 years ago

Dear EBI developers,

First, thank you for the great work you're doing on this.

I am using the development branch and trying to run the wgs-single-reads pipeline. The sample input_example file somewhat finishes successfully (it skips some steps), but with my input file I ran into the following issues (my main stumbling block, however, is the last one below):

0) (as its non-issue really) The input_examples/wgs-single-ERR1995312_small.fastq.gz is not a gzip file.

1) The following scripts have a wrong hashbang (#!/usr/bin/env /hps/nobackup2/production/...) at the top, preventing a successful run

docker/scripts_python3/count_lines.py
docker/scripts_python3/its-length-new.py

2) The following has remnants of text from git merge, preventing success docker image build tools/chunks/dna_chunker/Dockerfile

3) The following seem to use wrong basecommand for the Alpine docker image. They refer to bash, changing to sh works

 utils/count_lines/count_fastq_exp.cwl. 
 utils/count_number_lines.cwl

4) The rfam_models, and (ssu/lsu)_(db/tax/otus) are declared as strings in the YML and subsequest CWL files. Should they be type File (as in sample workflows/ymls/amplicon-wf--v.5-cond.yml file) instead? With type string the cwltool runs the Docker image specifying the absolute path of the DB file as a string that resides outside the docker image. I changed to type File and the pipeline runs successfully for the infernal's cmsearch and other steps that use these databaes. I changed the following files:

modified:   tools/RNA_prediction/cmsearch-deoverlap/cmsearch-deoverlap-v0.02.cwl
modified:   tools/RNA_prediction/cmsearch/infernal-cmsearch-v1.1.2.cwl
          --> further need to change in this file  
          - glob: $(inputs.query_sequences.basename).$(inputs.covariance_model_database.basename).cmsearch_matches.tbl
          + glob: $(inputs.query_sequences.basename).$(inputs.covariance_model_database.split('/').slice(-1)[0]).cmsearch_matches.tbl

modified:   tools/RNA_prediction/mapseq/mapseq.cwl
modified:   tools/RNA_prediction/mapseq2biom/mapseq2biom.cwl
modified:   workflows/conditionals/raw-reads/raw-reads-2.cwl
modified:   workflows/raw-reads-wf--v.5-cond.cwl
modified:   workflows/subworkflows/classify-otu-visualise.cwl
modified:   workflows/subworkflows/cmsearch-multimodel-wf.cwl
modified:   workflows/subworkflows/rna_prediction-sub-wf.cwl

5) My main issue is the input type Directory, at lines 252-260 in workflows/conditionals/raw-reads/raw-reads-2.cwl file. The pipeline fails with the following error message for this step. At the moment, I don't really have much of a clue how to rectify this.

INFO [step return_tax_dir] start
ERROR Exception on step 'return_tax_dir'
ERROR [step return_tax_dir] Cannot make job: Invalid job input record:
the `dir_list` field is not valid because
  tried array of <Directory> but
    item is invalid because
      is not a dict

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/.../.local/lib/python3.7/site-packages/cwltool/workflow_job.py", line 852, in job
    for newjob in step.iterable:
  File "/.../.local/lib/python3.7/site-packages/cwltool/workflow_job.py", line 771, in try_make_job
    for j in jobs:
  File "/home/kibriyam/.local/lib/python3.7/site-packages/cwltool/workflow_job.py", line 78, in job
    for j in self.step.job(joborder, output_callback, runtimeContext):
  File "/home/kibriyam/.local/lib/python3.7/site-packages/cwltool/workflow.py", line 443, in job
    runtimeContext,
  File "/.../.local/lib/python3.7/site-packages/cwltool/command_line_tool.py", line 166, in job
    builder = self._init_job(job_order, runtimeContext)
  File "/.../.local/lib/python3.7/site-packages/cwltool/process.py", line 819, in _init_job
    raise WorkflowException("Invalid job input record:\n" + str(err)) from err
cwltool.errors.WorkflowException: Invalid job input record:
the `dir_list` field is not valid because
  tried array of <Directory> but
    item is invalid because
      is not a dict

Let me know if I can provide something more to help you fix the above.

Thanks, Ashraf

mberacochea commented 4 years ago

Dear Ashraf,

Thank you for reporting this. We have addressed most of those issues in this branch (https://github.com/EBI-Metagenomics/pipeline-v5/tree/refactor-docker-files) - PR #22.

The current blocker to run the pipeline with docker is https://github.com/EBI-Metagenomics/pipeline-v5/issues/23

I'll close this ticket once #23 is fixed.

Cheers

amkibriya commented 4 years ago

Hi Martin ( @mberacochea ),

In the refactor-docker-files branch the motus classification step is failing for me. I notice this branch uses a different docker image microbiomeinformatics/pipeline-v5.motus:v2.5.1 compared to the develop branch (where this step succeeds) which uses mgnify/pipeline-v5.motus. Is this also a known issue?

It appears the db -db /data/databases/pipeline-v5/motus-v2.5.1/ is not present in the microbiomeinformatics/pipeline-v5.motus:v2.5.1 docker image.

Thanks in advance, Ashraf

mberacochea commented 4 years ago

Hi Ashraf (@amkibriya)

We removed the DB from the docker image to make it smaller. You need to run https://github.com/EBI-Metagenomics/pipeline-v5/blob/refactor-docker-files/tools/Raw_reads/mOTUs/mOTUs_download_db.py to get the database, this step was missing from the documentation.

Please, run:

cd <dbs-path>/motus-v2.5.1/
python <repo-base>/tools/Raw_reads/mOTUs/mOTUs_download_db.py
amkibriya commented 4 years ago

Hi Ashraf (@amkibriya)

Please, run:

cd <dbs-path>/motus-v2.5.1/
python <repo-base>/tools/Raw_reads/mOTUs/mOTUs_download_db.py

Ok thanks for that.

I also ran into issues at the [step interproscan]

I guess the [step interproscan] is also under TODO list?

Many thanks, Ashraf

mberacochea commented 4 years ago

Hi,

InterproScan has to be installed in the system and the dbs downloaded, we will try to improve the docker container in the future.

We have created the conda envs (https://github.com/EBI-Metagenomics/pipeline-v5/blob/refactor-docker-files/environment/README.md) and it's fairily easy to install interproscan. Docs: https://github.com/EBI-Metagenomics/pipeline-v5/tree/refactor-docker-files#docker

Cheers

amkibriya commented 4 years ago

Hi Ashraf (@amkibriya)

Please, run:

cd <dbs-path>/motus-v2.5.1/
python <repo-base>/tools/Raw_reads/mOTUs/mOTUs_download_db.py

Hi Martin (@mberacochea) ,

(1) The motus still seem to have some issues. It seems the database directory specified with the -db option doesn't get mounted in the image at the expected place. The script /mOTUs_v2-2.5.1/motus (lines 92-95) inside the motus docker image seem to expect the database mounted at /mOTUs_v2-2.5.1/db_mOTU instead of the /var/lib/cwl/.. mounted by the docker/cwltool.

(2) The step hmmsearch is failing for me. It seems to have the same issue as no. (4) in my first comment/message, the HMMSCAN database name is being passed as string instead of type File or Directory.

Many thanks, Ashraf

amkibriya commented 4 years ago

Hi,

So, I was able to use @KeteSakharova 's fix return dir update in the master branch for my main issue, the number (5) above. With a few more fixes (hmmsearch, go-slim and few others) I was able to run the (raw-reads) pipeline in the develop branch successfully to completion. So, I think I'll close this issue, since it works for me now.

Thanks for the assitance @mberacochea @KeteSakharova .

Kind Regards, Ashraf