RFE: fetch directly JSON from HDF5 files?

common-workflow-language / schema_salad

Semantic Annotations for Linked Avro Data

https://www.commonwl.org/v1.2/SchemaSalad.html

Apache License 2.0

72 stars 62 forks source link

RFE: fetch directly JSON from HDF5 files? #250

Open ankostis opened 5 years ago

ankostis commented 5 years ago

Would it be possible with slight changes in the resolving logic to support fetchers that produce YAML directly?

What i have specifically in mind is to read HDF5 files, either with pandas, or using the client-library of the newly published HDFServer+HDFJson standards. For both, i would need to:

craft "deep-linking" URLs into the HDF5 files,
modify the fetch-procedure to use one of the 2 access methods.

What would be the API changes needed to accomdate this?

mr-c commented 5 years ago

Hello @ankostis and thank you for your proposal.

If there is already a REST/HTTP API then there is no change needed to schema-salad as it already supports HTTP(S) URLs.

ankostis commented 5 years ago

Hmm...yes you're right. I forgot to mention that i wanted this to work locally, from a file:// url.

mr-c commented 5 years ago

Would this be for just the initial input data to a CWL workflow, or did you envision this access pattern being used between steps or at the end?

The first case is possible today, with local REST endpoint serving the input data.

ankostis commented 5 years ago

I have just a single workflow step with complex input data, so i figured that i can use schema-salad alone, like jsonschema on steroids. Does that make sense?

In any case, the salad will be parsed internally in my process, and i'm looking for the optimum way to patch this library so as to support extracting data from different binary file types, for which, fetch_text() does not make sense. And i want your opinion if this is a totally doomed direction.

tetron commented 5 years ago

@ankostis The quick solution is to inject a custom Fetcher using fetcher_constructor of Loader. You would then implement your custom fetch_text() which might involve reading the binary file, serializing to JSON, and returning the string to schema salad to be re-parsed, processed and validated.

A more complete solution might be to optionally move the parsing over to the other side of the Fetcher interface, adding something like a fetch_structured() method. However in order to support line numbers in error reporting, schema salad has an assumption of ruamel.yaml types (CommentedMap and CommentedSeq) so you can't just return plain Python dicts and lists.

ankostis commented 5 years ago

Great answer. Thanks. For the 1st case, wouldn't that mean that it will slightly waste cpu-cycles, since json would parse twice, once in each side of the fetcher interface?

tetron commented 5 years ago

Yes, it is somewhat inefficient, but worth trying as a proof of concept and then once it works looking at optimization.

joshmoore commented 3 years ago

Interesting discussion. I ran into this while considering how to integrate salad-based metadata into Zarr datasets. Very briefly, Zarr provides a very similar data structure as HDF5 but does so via multiple files. Relevant for this discussion: the metadata is stored as separate JSON files which would be loadable via file:/// without the need for a service.