common-workflow-language / cwl-utils

Python utilities for CWL
https://cwl-utils.readthedocs.io/
Apache License 2.0
36 stars 18 forks source link

Documentation for importing packed files #71

Open alexiswl opened 3 years ago

alexiswl commented 3 years ago

Hello,

Been playing around with how to import a packed cwl json file as a CWL parser object.

Here are my steps so far

Setup

# Imports
from pathlib import Path
import json
import sys

# Set path
cwl_file_path = Path("/path/to/cwl.packed.json")

# Load file as dict
# Read in the cwl file from a json
with open(cwl_file_path, "r") as cwl_h:
    cwl_file_dict = json.load(cwl_h)

# Conditional import based on cwl version
if 'cwlVersion' not in list(cwl_file_dict.keys()):
    print("Error - could not get the cwlVersion")
    sys.exit(1)
# Import parser based on CWL Version
if cwl_file_dict['cwlVersion'] == 'v1.0':
    from cwl_utils import parser_v1_0 as parser
elif cwl_file_dict['cwlVersion'] == 'v1.1':
    from cwl_utils import parser_v1_1 as parser
elif cwl_file_dict['cwlVersion'] == 'v1.2':
    from cwl_utils import parser_v1_2 as parser
else:
    print("Version error. Did not recognise {} as a CWL version".format(yaml_obj["CWLVersion"]))
    sys.exit(1)

First attempt:

Use the load document feature


parser.load_document(cwl_file_dict, cwl_file_path.absolute().as_uri()) 

SchemaSaladException: Cannot load $import without fileuri


## Second attempt
> Convert to string then load
```python
parser.load_document_by_string(json.dumps(cwl_file_dict), cwl_file_path.absolute().as_uri())

ValidationException: - tried _RecordLoader but
  Expected a dict
- tried _RecordLoader but
  Expected a dict
...

Third attempt

Convert to yaml then load


# We need to import the ruamel yaml class
from ruamel import yaml
# Dump our dict to a yaml string
cwl_yaml_dump = yaml.round_trip_dump(cwl_file_dict, Dumper=yaml.RoundTripDumper)
# Load yaml
cwl_yaml_load = yaml.round_trip_load(cwl_yaml_dump, preserve_quotes=True)
# Import 
parser.load_document_by_yaml(cwl_yaml_load, cwl_file_path.absolute().as_uri())

ValidationException: - tried _RecordLoader but Expected a dict

Fourth attempt

Convert just graph to yaml then load


# We need to import the ruamel yaml class
from ruamel import yaml
# Dump our dict to a yaml string
cwl_yaml_dump = yaml.round_trip_dump(cwl_file_dict['$graph'], Dumper=yaml.RoundTripDumper)
# Load yaml
cwl_yaml_load = yaml.round_trip_load(cwl_yaml_dump, preserve_quotes=True)
# Import 
parser.load_document_by_yaml(cwl_yaml_load, cwl_file_path.absolute().as_uri())

ValidationException: - tried _RecordLoader but Expected a dict

Is this due to my workflow being a little bit too complicated for the parser and using record schemas?

mr-c commented 3 years ago

Hey @alexiswl ; can you put an example packed workflow that exhibits this issue on https://gist.github.com/ or similar and drop the link here?

alexiswl commented 3 years ago

https://gist.github.com/alexiswl/5dd2bf9639f1b539a8c2dd4170f96ea7

mr-c commented 3 years ago

@alexiswl FYI, that file has ids in its custom types, that is not formally part of the CWL standard: https://www.commonwl.org/v1.2/CommandLineTool.html#CommandInputRecordSchema

alexiswl commented 3 years ago

Hi @mr-c, do you know why this might be? The raw yaml is now publicly accessible at https://github.com/umccr/cwl-ica/blob/main/workflows/bcl-conversion/3.7.5/bcl-conversion__3.7.5.cwl

None of the schemas present have the id attribute in them either:

At the moment, in order to import these workflows that contain schemas through the CWL parser, I have to first import the schema object and then manually append the schema object to the namespace.

See:
https://github.com/umccr/cwl-ica/blob/main/src/classes/cwl.py#L135-L154

For packed cwl files this would be a little more difficult for I need to first find the SchemaDefRequirement inside the graph and add them to the $namespaces attribute of the graph.

I guess something like so would be a possible way to grab the schemas required for the workflow.

$ cwltool --pack bcl-conversion__3.7.5.cwl | \
jq --raw-output '.["$graph"][-1].requirements[] | select(.class=="SchemaDefRequirement") | .types[] | .["$import"]'

#settings-by-samples__1.0.0.yaml
#fastq-list-row__1.0.0.yaml

Where the jq component of this would be done in python.

Still, it nonetheless seems quite hacky that this is a requirement.

mr-c commented 3 years ago

@alexiswl As you can see, your helpful example has launched many fixed to cwltool --pack, the code in schema_salad that produces the parsers, and the schema of the CWL standards themselves (!).

Ultimately (when all is done, merged, and released) the answer to your question will be "Load the packed document like any other." :-)

FYI, here is my variation on your testing script

"""
Import a cwl file as a parser object
"""

import sys
from pathlib import Path

from schema_salad.utils import yaml_no_ts 
# ^^ requires schema_salad >= 8.2
# does preserve_quotes=True and more

# Set path
cwl_file_path = Path(sys.argv[1])

# Load file as yaml dict
# Read in the cwl file from a json/yaml
with open(cwl_file_path, "r") as cwl_h:
    cwl_file_yaml = yaml_no_ts().load(cwl_h)

# Conditional import based on cwl version
if 'cwlVersion' not in cwl_file_yaml:
    print("Error - could not get the cwlVersion")
    sys.exit(1)
# Import parser based on CWL Version
if cwl_file_yaml['cwlVersion'] == 'v1.0':
    from cwl_utils import parser_v1_0 as parser
elif cwl_file_yaml['cwlVersion'] == 'v1.1':
    from cwl_utils import parser_v1_1 as parser
elif cwl_file_yaml['cwlVersion'] == 'v1.2':
    from cwl_utils import parser_v1_2 as parser
else:
    print("Version error. Did not recognise {} as a CWL version".format(yaml_obj["CWLVersion"]))
    sys.exit(1)

doc = parser.load_document_by_yaml(cwl_file_yaml, cwl_file_path.absolute().as_uri())
alexiswl commented 3 years ago

Thanks for this @mr-c! I appreciate the feedback and very happy to know that this has fixed multiple parts!

Do you recommend the yaml_no_ts from https://github.com/common-workflow-language/schema_salad/blob/main/schema_salad/utils.py#L133 over ruamel's 'round-trip-load' from https://sourceforge.net/p/ruamel-yaml/code/ci/default/tree/main.py#l1132 ?

Is the only difference the loading of timestamps?

mr-c commented 3 years ago

Is the only difference the loading of timestamps?

Correct. Probably not needed in your case

mr-c commented 3 years ago

@alexiswl Can you try packing with https://github.com/rabix/sbpack ?

alexiswl commented 3 years ago

Thanks for the suggestion @mr-c, looks like this would handle most of the workarounds we're currently doing. Is there a 'local-only' functionality of this tool / a way to import a local packed file? We don't use the Seven Bridges endpoint.

mr-c commented 3 years ago

Oh, I should have been more specific! It includes a local only tool named cwlpack