Use gxformat2 to convert .ga to .cwl?

simleo commented 3 years ago

Came up at the 2020 Elixir biohackathon.

Experimented with this in https://github.com/ResearchObject/ro-crate-py/tree/gxformat2_cwl_conv. Here are the changes. I checked the output from converting test/test-data/test_galaxy_wf.ga and the one output by gxformat2 is very different from the one obtained with galaxy2cwl. I'm not even sure the latter is a valid CWL workflow. Did I use the gxformat2 API in the wrong way? If not, maybe this needs to be checked by a CWL expert.

simleo commented 3 years ago

The file generated with gxformat2 does not validate. Building the docker container from https://github.com/ResearchObject/ro-crate-py/tree/9c2c74506226f4508985e86df7b1fa72f657f8b2:

docker build --no-cache -t ro-crate-py .
docker run --rm -it --name ro-crate-py ro-crate-py bash
# pip install pytest cwltool
# pytest test/
# cwltool --validate --enable-dev /tmp/pytest-of-root/pytest-current/test_galaxy_wf_cratecurrent/ro_crate_out/test_galaxy_wf.cwl 
INFO /usr/local/bin/cwltool 3.0.20201109103151
INFO Resolved '/tmp/pytest-of-root/pytest-current/test_galaxy_wf_cratecurrent/ro_crate_out/test_galaxy_wf.cwl' to 'file:///tmp/pytest-of-root/pytest-current/test_galaxy_wf_cratecurrent/ro_crate_out/test_galaxy_wf.cwl'
ERROR Tool definition failed validation:
No cwlVersion found. Use the following syntax in your CWL document to declare the version: cwlVersion: <version>.
Note: if this is a CWL draft-2 (pre v1.0) document then it will need to be upgraded first.

simleo commented 3 years ago

The code was missing the from_dict step, thanks @ieguinoa for adding it. However, the CWL file generated in the tests still does not validate (cwltool 3.0.20201109103151). Errors are like:

../tmp/pytest-of-root/pytest-current/test_galaxy_wf_cratecurrent/ro_crate_out/test_galaxy_wf.cwl:82:7:
Workflow step output 'realigned' does not correspond to
../tmp/pytest-of-root/pytest-current/test_galaxy_wf_cratecurrent/ro_crate_out/test_galaxy_wf.cwl:87:7:
tool output (expected '')

Here is the generated CWL:

class: Workflow
cwlVersion: v1.2
inputs:
  'GenBank file ':
    id: 'GenBank file '
    type: File
  Paired Collection (fastqsanger):
    id: Paired Collection (fastqsanger)
    type: File[]
outputs:
  _anonymous_output_1:
    outputSource: 'GenBank file '
    type: File
  _anonymous_output_2:
    outputSource: Paired Collection (fastqsanger)
    type: File
  _anonymous_output_3:
    outputSource: 2/snpeff_output
    type: File
  _anonymous_output_4:
    outputSource: 2/output_fasta
    type: File
  _anonymous_output_5:
    outputSource: 3/output_paired_coll
    type: File
  _anonymous_output_6:
    outputSource: 3/report_html
    type: File
  _anonymous_output_7:
    outputSource: 4/bam_output
    type: File
  FASTP_report:
    outputSource: 5/html_report
    type: File
  _anonymous_output_8:
    outputSource: 6/output1
    type: File
  _anonymous_output_9:
    outputSource: '7'
    type: File
  _anonymous_output_10:
    outputSource: 8/metrics_file
    type: File
  _anonymous_output_11:
    outputSource: 8/outFile
    type: File
  mapping_report:
    outputSource: 9/html_report
    type: File
  _anonymous_output_12:
    outputSource: 10/realigned
    type: File
  DeDup_Report:
    outputSource: 11/html_report
    type: File
  _anonymous_output_13:
    outputSource: 12/variants
    type: File
  _anonymous_output_14:
    outputSource: 13/statsFile
    type: File
  _anonymous_output_15:
    outputSource: 13/snpeff_output
    type: File
  _anonymous_output_16:
    outputSource: '14'
    type: File
  SnpEff vcf.gz:
    outputSource: 15/output1
    type: File
  _anonymous_output_17:
    outputSource: '16'
    type: File
steps:
  '10':
    in:
      reads:
        source: 8/outFile
      reference_source|ref:
        source: 2/output_fasta
    out:
    - realigned
    run:
      class: Operation
      doc: ''
      inputs: {}
      outputs: {}
  '11':
    in:
      results_0|software_cond|output_0|input:
        source: 8/metrics_file
    out:
    - plots
    - stats
    - html_report
    run:
      class: Operation
      doc: ''
      inputs: {}
      outputs: {}
  '12':
    in:
      reads:
        source: 10/realigned
      reference_source|ref:
        source: 2/output_fasta
    out:
    - variants
    run:
      class: Operation
      doc: ''
      inputs: {}
      outputs: {}
  '13':
    in:
      input:
        source: 12/variants
      snpDb|snpeff_db:
        source: 2/snpeff_output
    out:
    - snpeff_output
    - statsFile
    run:
      class: Operation
      doc: ''
      inputs: {}
      outputs: {}
  '14':
    in:
      input:
        source: 13/snpeff_output
    out: []
    run:
      class: Operation
      doc: ''
      inputs: {}
      outputs: {}
  '15':
    in:
      input1:
        source: 13/snpeff_output
    out:
    - output1
    run:
      class: Operation
      doc: ''
      inputs: {}
      outputs: {}
  '16':
    in:
      input_list:
        source: '14'
    out: []
    run:
      class: Operation
      doc: ''
      inputs: {}
      outputs: {}
  '2':
    in:
      input_type|input_gbk:
        source: 'GenBank file '
    out:
    - output_fasta
    - snpeff_output
    run:
      class: Operation
      doc: ''
      inputs: {}
      outputs: {}
  '3':
    in:
      single_paired|paired_input:
        source: Paired Collection (fastqsanger)
    out:
    - report_json
    - report_html
    - output_paired_coll
    run:
      class: Operation
      doc: ''
      inputs: {}
      outputs: {}
  '4':
    in:
      fastq_input|fastq_input1:
        source: 3/output_paired_coll
      reference_source|ref_file:
        source: 2/output_fasta
    out:
    - bam_output
    run:
      class: Operation
      doc: ''
      inputs: {}
      outputs: {}
  '5':
    in:
      results_0|software_cond|input:
        source: 3/report_json
    out:
    - plots
    - stats
    - html_report
    run:
      class: Operation
      doc: ''
      inputs: {}
      outputs: {}
  '6':
    in:
      input1:
        source: 4/bam_output
    out:
    - output1
    run:
      class: Operation
      doc: ''
      inputs: {}
      outputs: {}
  '7':
    in:
      input:
        source: 6/output1
    out: []
    run:
      class: Operation
      doc: ''
      inputs: {}
      outputs: {}
  '8':
    in:
      inputFile:
        source: 6/output1
    out:
    - outFile
    - metrics_file
    run:
      class: Operation
      doc: ''
      inputs: {}
      outputs: {}
  '9':
    in:
      results_0|software_cond|output_0|type|input:
        source: '7'
    out:
    - plots
    - stats
    - html_report
    run:
      class: Operation
      doc: ''
      inputs: {}
      outputs: {}

Note that the inputs and outputs fields are empty. For comparison, the following is the CWL we are currently generating with galaxy2cwl:

class: Workflow
cwlVersion: v1.2.0-dev2
doc: 'Abstract CWL Automatically generated from the Galaxy workflow file: COVID-19:
  PE Variation'
inputs:
  'GenBank file ':
    format: data
    type: File
  Paired Collection (fastqsanger):
    format: data
    type: File
outputs: {}
steps:
  10_Realign reads:
    in:
      reads: 8_MarkDuplicates/outFile
      reference_source|ref: 2_SnpEff build/output_fasta
    out:
    - realigned
    run:
      class: Operation
      id: toolshed_g2_bx_psu_edu_repos_iuc_lofreq_viterbi_lofreq_viterbi_2_1_3_1+galaxy1
      inputs:
        reads:
          format: Any
          type: File
        reference_source|ref:
          format: Any
          type: File
      outputs:
        realigned:
          doc: bam
          type: File
  11_MultiQC:
    in:
      results_0|software_cond|output_0|input: 8_MarkDuplicates/metrics_file
    out:
    - stats
    - plots
    - html_report
    run:
      class: Operation
      id: toolshed_g2_bx_psu_edu_repos_iuc_multiqc_multiqc_1_7_1
      inputs:
        results_0|software_cond|output_0|input:
          format: Any
          type: File
      outputs:
        html_report:
          doc: html
          type: File
        plots:
          doc: input
          type: File
        stats:
          doc: input
          type: File
  12_Call variants:
    in:
      reads: 10_Realign reads/realigned
      reference_source|ref: 2_SnpEff build/output_fasta
    out:
    - variants
    run:
      class: Operation
      id: toolshed_g2_bx_psu_edu_repos_iuc_lofreq_call_lofreq_call_2_1_3_1+galaxy0
      inputs:
        reads:
          format: Any
          type: File
        reference_source|ref:
          format: Any
          type: File
      outputs:
        variants:
          doc: vcf
          type: File
  13_SnpEff eff:
    in:
      input: 12_Call variants/variants
      snpDb|snpeff_db: 2_SnpEff build/snpeff_output
    out:
    - snpeff_output
    - statsFile
    run:
      class: Operation
      id: toolshed_g2_bx_psu_edu_repos_iuc_snpeff_snpEff_4_3+T_galaxy1
      inputs:
        input:
          format: Any
          type: File
        snpDb|snpeff_db:
          format: Any
          type: File
      outputs:
        snpeff_output:
          doc: vcf
          type: File
        statsFile:
          doc: html
          type: File
  14_SnpSift Extract Fields:
    in:
      input: 13_SnpEff eff/snpeff_output
    out:
    - output
    run:
      class: Operation
      id: toolshed_g2_bx_psu_edu_repos_iuc_snpsift_snpSift_extractFields_4_3+t_galaxy0
      inputs:
        input:
          format: Any
          type: File
      outputs:
        output:
          doc: tabular
          type: File
  15_Convert VCF to VCF_BGZIP:
    in:
      input1: 13_SnpEff eff/snpeff_output
    out:
    - output1
    run:
      class: Operation
      id: CONVERTER_vcf_to_vcf_bgzip_0
      inputs:
        input1:
          format: Any
          type: File
      outputs:
        output1:
          doc: vcf_bgzip
          type: File
  16_Collapse Collection:
    in:
      input_list: 14_SnpSift Extract Fields/output
    out:
    - output
    run:
      class: Operation
      id: toolshed_g2_bx_psu_edu_repos_nml_collapse_collections_collapse_dataset_4_1
      inputs:
        input_list:
          format: Any
          type: File
      outputs:
        output:
          doc: input
          type: File
  2_SnpEff build:
    in:
      input_type|input_gbk: 'GenBank file '
    out:
    - snpeff_output
    - output_fasta
    run:
      class: Operation
      id: toolshed_g2_bx_psu_edu_repos_iuc_snpeff_snpEff_build_gb_4_3+T_galaxy4
      inputs:
        input_type|input_gbk:
          format: Any
          type: File
      outputs:
        output_fasta:
          doc: fasta
          type: File
        snpeff_output:
          doc: snpeffdb
          type: File
  3_fastp:
    in:
      single_paired|paired_input: Paired Collection (fastqsanger)
    out:
    - output_paired_coll
    - report_html
    - report_json
    run:
      class: Operation
      id: toolshed_g2_bx_psu_edu_repos_iuc_fastp_fastp_0_19_5+galaxy1
      inputs:
        single_paired|paired_input:
          format: Any
          type: File
      outputs:
        output_paired_coll:
          doc: input
          type: File
        report_html:
          doc: html
          type: File
        report_json:
          doc: json
          type: File
  4_Map with BWA-MEM:
    in:
      fastq_input|fastq_input1: 3_fastp/output_paired_coll
      reference_source|ref_file: 2_SnpEff build/output_fasta
    out:
    - bam_output
    run:
      class: Operation
      id: toolshed_g2_bx_psu_edu_repos_devteam_bwa_bwa_mem_0_7_17_1
      inputs:
        fastq_input|fastq_input1:
          format: Any
          type: File
        reference_source|ref_file:
          format: Any
          type: File
      outputs:
        bam_output:
          doc: bam
          type: File
  5_MultiQC:
    in:
      results_0|software_cond|input: 3_fastp/report_json
    out:
    - stats
    - plots
    - html_report
    run:
      class: Operation
      id: toolshed_g2_bx_psu_edu_repos_iuc_multiqc_multiqc_1_7_1
      inputs:
        results_0|software_cond|input:
          format: Any
          type: File
      outputs:
        html_report:
          doc: html
          type: File
        plots:
          doc: input
          type: File
        stats:
          doc: input
          type: File
  6_Filter SAM or BAM, output SAM or BAM:
    in:
      input1: 4_Map with BWA-MEM/bam_output
    out:
    - output1
    run:
      class: Operation
      id: toolshed_g2_bx_psu_edu_repos_devteam_samtool_filter2_samtool_filter2_1_8+galaxy1
      inputs:
        input1:
          format: Any
          type: File
      outputs:
        output1:
          doc: sam
          type: File
  7_Samtools stats:
    in:
      input: 6_Filter SAM or BAM, output SAM or BAM/output1
    out:
    - output
    run:
      class: Operation
      id: toolshed_g2_bx_psu_edu_repos_devteam_samtools_stats_samtools_stats_2_0_2+galaxy2
      inputs:
        input:
          format: Any
          type: File
      outputs:
        output:
          doc: tabular
          type: File
  8_MarkDuplicates:
    in:
      inputFile: 6_Filter SAM or BAM, output SAM or BAM/output1
    out:
    - metrics_file
    - outFile
    run:
      class: Operation
      id: toolshed_g2_bx_psu_edu_repos_devteam_picard_picard_MarkDuplicates_2_18_2_2
      inputs:
        inputFile:
          format: Any
          type: File
      outputs:
        metrics_file:
          doc: txt
          type: File
        outFile:
          doc: bam
          type: File
  9_MultiQC:
    in:
      results_0|software_cond|output_0|type|input: 7_Samtools stats/output
    out:
    - stats
    - plots
    - html_report
    run:
      class: Operation
      id: toolshed_g2_bx_psu_edu_repos_iuc_multiqc_multiqc_1_7_1
      inputs:
        results_0|software_cond|output_0|type|input:
          format: Any
          type: File
      outputs:
        html_report:
          doc: html
          type: File
        plots:
          doc: input
          type: File
        stats:
          doc: input
          type: File

I've opened a draft PR from the branch to make it easier to track changes

ResearchObject / ro-crate-py

Use gxformat2 to convert .ga to .cwl? #33