Open ankostis opened 5 years ago
Hello @ankostis and thank you for your proposal.
If there is already a REST/HTTP API then there is no change needed to schema-salad
as it already supports HTTP(S) URLs.
Hmm...yes you're right. I forgot to mention that i wanted this to work locally, from a file:// url.
Would this be for just the initial input data to a CWL workflow, or did you envision this access pattern being used between steps or at the end?
The first case is possible today, with local REST endpoint serving the input data.
I have just a single workflow step with complex input data, so i figured that i can use schema-salad alone, like jsonschema on steroids. Does that make sense?
In any case, the salad will be parsed internally in my process, and i'm looking for the optimum way to patch this library so as to support extracting data from different binary file types, for which, fetch_text()
does not make sense.
And i want your opinion if this is a totally doomed direction.
@ankostis The quick solution is to inject a custom Fetcher
using fetcher_constructor
of Loader
. You would then implement your custom fetch_text() which might involve reading the binary file, serializing to JSON, and returning the string to schema salad to be re-parsed, processed and validated.
A more complete solution might be to optionally move the parsing over to the other side of the Fetcher interface, adding something like a fetch_structured() method. However in order to support line numbers in error reporting, schema salad has an assumption of ruamel.yaml types (CommentedMap and CommentedSeq) so you can't just return plain Python dicts and lists.
Great answer. Thanks. For the 1st case, wouldn't that mean that it will slightly waste cpu-cycles, since json would parse twice, once in each side of the fetcher interface?
Yes, it is somewhat inefficient, but worth trying as a proof of concept and then once it works looking at optimization.
Interesting discussion. I ran into this while considering how to integrate salad-based metadata into Zarr datasets. Very briefly, Zarr provides a very similar data structure as HDF5 but does so via multiple files. Relevant for this discussion: the metadata is stored as separate JSON files which would be loadable via file:///
without the need for a service.
Would it be possible with slight changes in the resolving logic to support fetchers that produce YAML directly?
What i have specifically in mind is to read HDF5 files, either with pandas, or using the client-library of the newly published HDFServer+HDFJson standards. For both, i would need to:
What would be the API changes needed to accomdate this?