Open simleo opened 3 years ago
The file generated with gxformat2 does not validate. Building the docker container from https://github.com/ResearchObject/ro-crate-py/tree/9c2c74506226f4508985e86df7b1fa72f657f8b2:
docker build --no-cache -t ro-crate-py .
docker run --rm -it --name ro-crate-py ro-crate-py bash
# pip install pytest cwltool
# pytest test/
# cwltool --validate --enable-dev /tmp/pytest-of-root/pytest-current/test_galaxy_wf_cratecurrent/ro_crate_out/test_galaxy_wf.cwl
INFO /usr/local/bin/cwltool 3.0.20201109103151
INFO Resolved '/tmp/pytest-of-root/pytest-current/test_galaxy_wf_cratecurrent/ro_crate_out/test_galaxy_wf.cwl' to 'file:///tmp/pytest-of-root/pytest-current/test_galaxy_wf_cratecurrent/ro_crate_out/test_galaxy_wf.cwl'
ERROR Tool definition failed validation:
No cwlVersion found. Use the following syntax in your CWL document to declare the version: cwlVersion: <version>.
Note: if this is a CWL draft-2 (pre v1.0) document then it will need to be upgraded first.
The code was missing the from_dict
step, thanks @ieguinoa for adding it.
However, the CWL file generated in the tests still does not validate (cwltool 3.0.20201109103151). Errors are like:
../tmp/pytest-of-root/pytest-current/test_galaxy_wf_cratecurrent/ro_crate_out/test_galaxy_wf.cwl:82:7:
Workflow step output 'realigned' does not correspond to
../tmp/pytest-of-root/pytest-current/test_galaxy_wf_cratecurrent/ro_crate_out/test_galaxy_wf.cwl:87:7:
tool output (expected '')
Here is the generated CWL:
class: Workflow
cwlVersion: v1.2
inputs:
'GenBank file ':
id: 'GenBank file '
type: File
Paired Collection (fastqsanger):
id: Paired Collection (fastqsanger)
type: File[]
outputs:
_anonymous_output_1:
outputSource: 'GenBank file '
type: File
_anonymous_output_2:
outputSource: Paired Collection (fastqsanger)
type: File
_anonymous_output_3:
outputSource: 2/snpeff_output
type: File
_anonymous_output_4:
outputSource: 2/output_fasta
type: File
_anonymous_output_5:
outputSource: 3/output_paired_coll
type: File
_anonymous_output_6:
outputSource: 3/report_html
type: File
_anonymous_output_7:
outputSource: 4/bam_output
type: File
FASTP_report:
outputSource: 5/html_report
type: File
_anonymous_output_8:
outputSource: 6/output1
type: File
_anonymous_output_9:
outputSource: '7'
type: File
_anonymous_output_10:
outputSource: 8/metrics_file
type: File
_anonymous_output_11:
outputSource: 8/outFile
type: File
mapping_report:
outputSource: 9/html_report
type: File
_anonymous_output_12:
outputSource: 10/realigned
type: File
DeDup_Report:
outputSource: 11/html_report
type: File
_anonymous_output_13:
outputSource: 12/variants
type: File
_anonymous_output_14:
outputSource: 13/statsFile
type: File
_anonymous_output_15:
outputSource: 13/snpeff_output
type: File
_anonymous_output_16:
outputSource: '14'
type: File
SnpEff vcf.gz:
outputSource: 15/output1
type: File
_anonymous_output_17:
outputSource: '16'
type: File
steps:
'10':
in:
reads:
source: 8/outFile
reference_source|ref:
source: 2/output_fasta
out:
- realigned
run:
class: Operation
doc: ''
inputs: {}
outputs: {}
'11':
in:
results_0|software_cond|output_0|input:
source: 8/metrics_file
out:
- plots
- stats
- html_report
run:
class: Operation
doc: ''
inputs: {}
outputs: {}
'12':
in:
reads:
source: 10/realigned
reference_source|ref:
source: 2/output_fasta
out:
- variants
run:
class: Operation
doc: ''
inputs: {}
outputs: {}
'13':
in:
input:
source: 12/variants
snpDb|snpeff_db:
source: 2/snpeff_output
out:
- snpeff_output
- statsFile
run:
class: Operation
doc: ''
inputs: {}
outputs: {}
'14':
in:
input:
source: 13/snpeff_output
out: []
run:
class: Operation
doc: ''
inputs: {}
outputs: {}
'15':
in:
input1:
source: 13/snpeff_output
out:
- output1
run:
class: Operation
doc: ''
inputs: {}
outputs: {}
'16':
in:
input_list:
source: '14'
out: []
run:
class: Operation
doc: ''
inputs: {}
outputs: {}
'2':
in:
input_type|input_gbk:
source: 'GenBank file '
out:
- output_fasta
- snpeff_output
run:
class: Operation
doc: ''
inputs: {}
outputs: {}
'3':
in:
single_paired|paired_input:
source: Paired Collection (fastqsanger)
out:
- report_json
- report_html
- output_paired_coll
run:
class: Operation
doc: ''
inputs: {}
outputs: {}
'4':
in:
fastq_input|fastq_input1:
source: 3/output_paired_coll
reference_source|ref_file:
source: 2/output_fasta
out:
- bam_output
run:
class: Operation
doc: ''
inputs: {}
outputs: {}
'5':
in:
results_0|software_cond|input:
source: 3/report_json
out:
- plots
- stats
- html_report
run:
class: Operation
doc: ''
inputs: {}
outputs: {}
'6':
in:
input1:
source: 4/bam_output
out:
- output1
run:
class: Operation
doc: ''
inputs: {}
outputs: {}
'7':
in:
input:
source: 6/output1
out: []
run:
class: Operation
doc: ''
inputs: {}
outputs: {}
'8':
in:
inputFile:
source: 6/output1
out:
- outFile
- metrics_file
run:
class: Operation
doc: ''
inputs: {}
outputs: {}
'9':
in:
results_0|software_cond|output_0|type|input:
source: '7'
out:
- plots
- stats
- html_report
run:
class: Operation
doc: ''
inputs: {}
outputs: {}
Note that the inputs
and outputs
fields are empty. For comparison, the following is the CWL we are currently generating with galaxy2cwl:
class: Workflow
cwlVersion: v1.2.0-dev2
doc: 'Abstract CWL Automatically generated from the Galaxy workflow file: COVID-19:
PE Variation'
inputs:
'GenBank file ':
format: data
type: File
Paired Collection (fastqsanger):
format: data
type: File
outputs: {}
steps:
10_Realign reads:
in:
reads: 8_MarkDuplicates/outFile
reference_source|ref: 2_SnpEff build/output_fasta
out:
- realigned
run:
class: Operation
id: toolshed_g2_bx_psu_edu_repos_iuc_lofreq_viterbi_lofreq_viterbi_2_1_3_1+galaxy1
inputs:
reads:
format: Any
type: File
reference_source|ref:
format: Any
type: File
outputs:
realigned:
doc: bam
type: File
11_MultiQC:
in:
results_0|software_cond|output_0|input: 8_MarkDuplicates/metrics_file
out:
- stats
- plots
- html_report
run:
class: Operation
id: toolshed_g2_bx_psu_edu_repos_iuc_multiqc_multiqc_1_7_1
inputs:
results_0|software_cond|output_0|input:
format: Any
type: File
outputs:
html_report:
doc: html
type: File
plots:
doc: input
type: File
stats:
doc: input
type: File
12_Call variants:
in:
reads: 10_Realign reads/realigned
reference_source|ref: 2_SnpEff build/output_fasta
out:
- variants
run:
class: Operation
id: toolshed_g2_bx_psu_edu_repos_iuc_lofreq_call_lofreq_call_2_1_3_1+galaxy0
inputs:
reads:
format: Any
type: File
reference_source|ref:
format: Any
type: File
outputs:
variants:
doc: vcf
type: File
13_SnpEff eff:
in:
input: 12_Call variants/variants
snpDb|snpeff_db: 2_SnpEff build/snpeff_output
out:
- snpeff_output
- statsFile
run:
class: Operation
id: toolshed_g2_bx_psu_edu_repos_iuc_snpeff_snpEff_4_3+T_galaxy1
inputs:
input:
format: Any
type: File
snpDb|snpeff_db:
format: Any
type: File
outputs:
snpeff_output:
doc: vcf
type: File
statsFile:
doc: html
type: File
14_SnpSift Extract Fields:
in:
input: 13_SnpEff eff/snpeff_output
out:
- output
run:
class: Operation
id: toolshed_g2_bx_psu_edu_repos_iuc_snpsift_snpSift_extractFields_4_3+t_galaxy0
inputs:
input:
format: Any
type: File
outputs:
output:
doc: tabular
type: File
15_Convert VCF to VCF_BGZIP:
in:
input1: 13_SnpEff eff/snpeff_output
out:
- output1
run:
class: Operation
id: CONVERTER_vcf_to_vcf_bgzip_0
inputs:
input1:
format: Any
type: File
outputs:
output1:
doc: vcf_bgzip
type: File
16_Collapse Collection:
in:
input_list: 14_SnpSift Extract Fields/output
out:
- output
run:
class: Operation
id: toolshed_g2_bx_psu_edu_repos_nml_collapse_collections_collapse_dataset_4_1
inputs:
input_list:
format: Any
type: File
outputs:
output:
doc: input
type: File
2_SnpEff build:
in:
input_type|input_gbk: 'GenBank file '
out:
- snpeff_output
- output_fasta
run:
class: Operation
id: toolshed_g2_bx_psu_edu_repos_iuc_snpeff_snpEff_build_gb_4_3+T_galaxy4
inputs:
input_type|input_gbk:
format: Any
type: File
outputs:
output_fasta:
doc: fasta
type: File
snpeff_output:
doc: snpeffdb
type: File
3_fastp:
in:
single_paired|paired_input: Paired Collection (fastqsanger)
out:
- output_paired_coll
- report_html
- report_json
run:
class: Operation
id: toolshed_g2_bx_psu_edu_repos_iuc_fastp_fastp_0_19_5+galaxy1
inputs:
single_paired|paired_input:
format: Any
type: File
outputs:
output_paired_coll:
doc: input
type: File
report_html:
doc: html
type: File
report_json:
doc: json
type: File
4_Map with BWA-MEM:
in:
fastq_input|fastq_input1: 3_fastp/output_paired_coll
reference_source|ref_file: 2_SnpEff build/output_fasta
out:
- bam_output
run:
class: Operation
id: toolshed_g2_bx_psu_edu_repos_devteam_bwa_bwa_mem_0_7_17_1
inputs:
fastq_input|fastq_input1:
format: Any
type: File
reference_source|ref_file:
format: Any
type: File
outputs:
bam_output:
doc: bam
type: File
5_MultiQC:
in:
results_0|software_cond|input: 3_fastp/report_json
out:
- stats
- plots
- html_report
run:
class: Operation
id: toolshed_g2_bx_psu_edu_repos_iuc_multiqc_multiqc_1_7_1
inputs:
results_0|software_cond|input:
format: Any
type: File
outputs:
html_report:
doc: html
type: File
plots:
doc: input
type: File
stats:
doc: input
type: File
6_Filter SAM or BAM, output SAM or BAM:
in:
input1: 4_Map with BWA-MEM/bam_output
out:
- output1
run:
class: Operation
id: toolshed_g2_bx_psu_edu_repos_devteam_samtool_filter2_samtool_filter2_1_8+galaxy1
inputs:
input1:
format: Any
type: File
outputs:
output1:
doc: sam
type: File
7_Samtools stats:
in:
input: 6_Filter SAM or BAM, output SAM or BAM/output1
out:
- output
run:
class: Operation
id: toolshed_g2_bx_psu_edu_repos_devteam_samtools_stats_samtools_stats_2_0_2+galaxy2
inputs:
input:
format: Any
type: File
outputs:
output:
doc: tabular
type: File
8_MarkDuplicates:
in:
inputFile: 6_Filter SAM or BAM, output SAM or BAM/output1
out:
- metrics_file
- outFile
run:
class: Operation
id: toolshed_g2_bx_psu_edu_repos_devteam_picard_picard_MarkDuplicates_2_18_2_2
inputs:
inputFile:
format: Any
type: File
outputs:
metrics_file:
doc: txt
type: File
outFile:
doc: bam
type: File
9_MultiQC:
in:
results_0|software_cond|output_0|type|input: 7_Samtools stats/output
out:
- stats
- plots
- html_report
run:
class: Operation
id: toolshed_g2_bx_psu_edu_repos_iuc_multiqc_multiqc_1_7_1
inputs:
results_0|software_cond|output_0|type|input:
format: Any
type: File
outputs:
html_report:
doc: html
type: File
plots:
doc: input
type: File
stats:
doc: input
type: File
I've opened a draft PR from the branch to make it easier to track changes
Came up at the 2020 Elixir biohackathon.
Experimented with this in https://github.com/ResearchObject/ro-crate-py/tree/gxformat2_cwl_conv. Here are the changes. I checked the output from converting
test/test-data/test_galaxy_wf.ga
and the one output by gxformat2 is very different from the one obtained with galaxy2cwl. I'm not even sure the latter is a valid CWL workflow. Did I use the gxformat2 API in the wrong way? If not, maybe this needs to be checked by a CWL expert.