common-workflow-language / cwl-utils

Python utilities for CWL
https://cwl-utils.readthedocs.io/
Apache License 2.0
36 stars 18 forks source link

Importing tools that use `SchemaDefRequirement` requires schema to be present in $namespaces #74

Closed alexiswl closed 3 years ago

alexiswl commented 3 years ago

I have the following code that takes a list of contig 'objects' and converts them into a list of files.

The tool

The tool custom-create-regions-bed-from-contigs-list__1.0.0.cwl is a valid cwl tool and is shown below

Click to expand! ```yaml cwlVersion: v1.1 class: CommandLineTool # Extensions $namespaces: s: https://schema.org/ ilmn-tes: http://platform.illumina.com/rdf/ica/ $schemas: - https://schema.org/version/latest/schemaorg-current-http.rdf # Metadata s:author: class: s:Person s:name: Alexis Lucattini s:email: Alexis.Lucattini@umccr.org s:identifier: https://orcid.org/0000-0001-9754-647X # ID/Docs id: custom-create-regions-bed-from-contigs-list--1.0.0 label: custom-create-regions-bed-from-contigs-list v(1.0.0) doc: | create a bed file from a list of contigs objects hints: ResourceRequirement: ilmn-tes:resources: tier: standard type: standard size: small coresMin: 1 ramMin: 2000 DockerRequirement: dockerPull: umccr/alpine-pandas:latest requirements: SchemaDefRequirement: types: - $import: contig__1.0.0.yaml InlineJavascriptRequirement: {} InitialWorkDirRequirement: listing: - entryname: get_regions_bed.py entry: | #!/usr/bin/env python3 """ Import args Collect args and confirm Generate regions bed from args """ # Imports import pandas as pd import argparse from itertools import chain import sys from pathlib import Path # Globals OUTPUT_COLUMNS = ["chromosome", "start", "end"] # Inputs def get_args(): """ Get arguments for the command """ parser = argparse.ArgumentParser(description="Create regions bed from contigs object list") # Arguments parser.add_argument("--output-regions-bed", required=True, help="Path to output bed file") parser.add_argument("--contig", action="append", nargs='*', required=True, help="Each of the contig objects") return parser.parse_args() # Check args def set_args(args): """ Check arguments """ # Create directory for bed file parent_dir = Path(getattr(args, "output_regions_bed", None)).parent parent_dir.mkdir(parents=True, exist_ok=True) # Initialise args dict with mandatory args contigs_arg = getattr(args, "fastq_list_row", []) contigs = [] for contig in contigs_arg: contigs.append(json.loads(contig[0])) setattr(args, "contigs_list", contigs) return args # Create DF from args dict def create_regions_bed_from_contigs(contigs): """ Create a dataframe from the set args output """ # Create dataframe from args dict regions_df = pd.DataFrame(contigs) # Return data frame return regions_df def finalise_output_df(regions_df): """ Returns the regions bed with the right column order. The should already be in this order but just to make sure """ regions_df = regions_df.reindex(columns=OUTPUT_COLUMNS) return regions_df def write_regions_obj_to_bed(regions_df, output_file): """ Write the regions_df to the specified output file """ regions_df.to_csv(output_file, sep="\t", header=False, index=False) def main(): # Get args args = get_args() # Get args dict from args and check args args = set_args(args) # Create df from args dict regions_df = create_regions_bed_from_args(args.contigs_list) # Construct output dfs regions_df = finalise_output_df(regions_df) # Write out csv write_regions_obj_to_bed(regions_df, args.output_regions_bed) if __name__ == "__main__": main() baseCommand: [ "python", "get_regions_bed.py" ] inputs: contig_list: label: List of contigs doc: | Each contig has the following attributes: * chromosome * start * end type: - type: array items: contig__1.0.0.yaml#contig inputBinding: prefix: "--contig=" separate: false valueFrom: | ${ return JSON.stringify(self); } inputBinding: # Makes sure all items are together position: 1 regions_bed: label: output file name for the regions bed file doc: | The output regions bed file name type: string? default: "regions.bed" inputBinding: prefix: "--output-regions-bed" outputs: regions_bed_out: label: regions bed out doc: | This is the output of the regions bed file type: File outputBinding: glob: "$(inputs.regions_bed)" successCodes: - 0 ```

The schema

The schema file contig__1.0.0.yaml has the following contents

Click to expand! ```yaml type: record name: contig fields: chromosome: label: chromosome doc: | The name of the chromosome type: string start: label: start position doc: | The start position of the chromosome of the region type: int? end: label: end position doc: | The end position of the chromosome of the region type: int? ```

The code

I use the following code to import the cwl file through the cwl-utils parser object

Click to expand! ```python # Imports from cwl_utils import parser_v1_1 as parser from pathlib import Path from ruamel import yaml # Vars cwl_tool_file_path=Path("custom-create-regions-bed-from-contigs-list__1.0.0.cwl") cwl_schema_file_path=Path("contig.yaml") # Read in the cwl file from a yaml with open(cwl_tool_file_path, "r") as cwl_h: cwl_tool_yaml_obj = yaml.main.round_trip_load(cwl_h, preserve_quotes=True) # Load the document and get the following error parser.load_document_by_yaml(cwl_tool_yaml_obj, cwl_tool_file_path.absolute().as_uri()) ```

Traceback error

Click to expand! ``` Traceback (most recent call last): File "", line 1, in File "/home/alexiswl/anaconda3/envs/cwl-ica/lib/python3.8/site-packages/cwl_utils/parser_v1_1.py", line 12164, in load_document_by_yaml return _document_load(union_of_CommandLineToolLoader_or_ExpressionToolLoader_or_WorkflowLoader_or_array_of_union_of_CommandLineToolLoader_or_ExpressionToolLoader_or_WorkflowLoader, yaml, uri, loadingOptions) File "/home/alexiswl/anaconda3/envs/cwl-ica/lib/python3.8/site-packages/cwl_utils/parser_v1_1.py", line 557, in _document_load return loader.load(doc, baseuri, loadingOptions, docRoot=baseuri) File "/home/alexiswl/anaconda3/envs/cwl-ica/lib/python3.8/site-packages/cwl_utils/parser_v1_1.py", line 394, in load raise ValidationException("", None, errors, "-") schema_salad.exceptions.ValidationException: - tried _RecordLoader but Trying 'CommandLineTool' the `inputs` field is not valid because: - tried _ArrayLoader but Expected a list - tried _RecordLoader but Trying 'CommandInputParameter' custom-create-regions-bed-from-contigs-list__1.0.0.cwl:162:5: the `type` field is not valid because: - tried _EnumLoader but Expected one of ('File', 'Directory') - tried _EnumLoader but Expected one of ('stdin',) - tried _RecordLoader but Expected a dict - tried _RecordLoader but Expected a dict - tried _RecordLoader but Expected a dict - tried _PrimitiveLoader but Expected a tuple but got list - tried _ArrayLoader but - tried _ArrayLoader but Expected a list - tried _UnionLoader but - tried _EnumLoader but Expected one of ('File', 'Directory') - tried _RecordLoader but Trying 'CommandInputRecordSchema' custom-create-regions-bed-from-contigs-list__1.0.0.cwl:163:9: the `type` field is not valid because: Expected one of ('record',) custom-create-regions-bed-from-contigs-list__1.0.0.cwl:164:9: invalid field `items`, expected one of: `fields`, `type`, `label`, `doc`, `name`, `inputBinding` custom-create-regions-bed-from-contigs-list__1.0.0.cwl:162:5: - tried _RecordLoader but Trying 'CommandInputEnumSchema' custom-create-regions-bed-from-contigs-list__1.0.0.cwl:163:9: the `symbols` field is not valid because: Expected a list the `type` field is not valid because: Expected one of ('enum',) custom-create-regions-bed-from-contigs-list__1.0.0.cwl:164:9: invalid field `items`, expected one of: `symbols`, `type`, `label`, `doc`, `name`, `inputBinding` custom-create-regions-bed-from-contigs-list__1.0.0.cwl:162:5: - tried _RecordLoader but Trying 'CommandInputArraySchema' custom-create-regions-bed-from-contigs-list__1.0.0.cwl:164:9: the `items` field is not valid because: Term 'contig__1.0.0.yaml#contig' not in vocabulary custom-create-regions-bed-from-contigs-list__1.0.0.cwl:162:5: - tried _PrimitiveLoader but Expected a tuple but got CommentedMap - tried _RecordLoader but Not a ExpressionTool - tried _RecordLoader but Not a Workflow - tried _ArrayLoader but Expected a list ```

My digging around

The exception is raised on line 222 in the expand_url function

The url variable is set to contig__1.0.0.yaml#contig.

I tinkered around by trying to import the loading options from the schema object into the vocab / rvocab objects to circumvent the urlsplit function on line 217 by simply returning the url on line 172. However it seems that loading options are not propagated through inputs anyway and when CommandArrayInputSchema is initialised with no loading options it inherits the defaults , without also looking at the SchemaDefRequirement imports.

A quick workaround

The LoadingOptions initialiser does however check the namespaces however, so simply adding in the following line into the tool:

$namespaces:
    ...
    contig__1.0.0.yaml#contig: contig__1.0.0.yaml#contig

resolves the issue and I can import the cwl object through the parser.

Update

This doesn't work when the tool is used in a workflow where another tool has this schema as an output and is the input to this tool.

Instead, I've resolved the workaround by manually adding in the namespace into the yaml object before loading through the parser.

# Imports
from cwl_utils import parser_v1_1 as parser
from pathlib import Path
from ruamel import yaml

# Vars
cwl_tool_file_path=Path("custom-create-regions-bed-from-contigs-list__1.0.0.cwl")
cwl_schema_file_path=Path("contig.yaml")

# Read in the cwl file from a yaml
with open(cwl_tool_file_path, "r") as cwl_h:
    cwl_tool_yaml_obj = yaml.main.round_trip_load(cwl_h, preserve_quotes=True)

# Check for schemas
if cwl_tool_yaml_obj .get("requirements", None) is None:
    pass
elif cwl_tool_yaml_obj .get("requirements").get("SchemaDefRequirement", None) is None:
    pass
elif cwl_tool_yaml_obj .get("requirements").get("SchemaDefRequirement").get("types", None) is None:
    pass
else:
    for imports in cwl_tool_yaml_obj .get("requirements").get("SchemaDefRequirement").get("types"):
        # We need the relative import path and the schema path
        schema_relative_imports_path = imports.get("$import")
        schema_import_path = (Path(self.cwl_file_path).parent / Path(schema_relative_imports_path)).resolve()

        # Open the schema as a RecordSchema object
        with open(schema_import_path, "r") as cwl_h:
            cwl_schema_yaml_obj = yaml.main.round_trip_load(cwl_h, preserve_quotes=True)

        # Read schema as a record schema object and get the name
        schema_name = RecordSchema(cwl_schema_yaml_obj ).type.get("name")

        # Get schema string like 'contig__1.0.0#contig'
        schema_namespace_str = "#".join(map(str, [schema_relative_imports_path, schema_name]))

        # Add to namespace
        if yaml_obj.get('$namespaces') is None:
            yaml_obj['$namespaces'] = OrderedDict({
                schema_namespace_str: schema_namespace_str
            })
        else:
            yaml_obj['$namespaces'][schema_namespace_str] = schema_namespace_str

# Load the document and get the following error
parser.load_document_by_yaml(cwl_tool_yaml_obj, cwl_tool_file_path.absolute().as_uri())
mr-c commented 3 years ago

This appears to be fixed as of https://github.com/common-workflow-language/cwl-utils/commit/5dfb3d6e8157212c36a5ec516bc42af1072c6dea or earlier!