asdf-format / asdf-standard

Standards document describing ASDF, Advanced Scientific Data Format
http://asdf-standard.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
72 stars 29 forks source link

Is this standard still maintained/current? #438

Closed acovaci closed 2 months ago

acovaci commented 2 months ago

Hi,

I was looking at a cross-filetype way to define a schema, and I came across this standard, that seems to solve some issues that filetype-specific standards have. And it looks promising.

But looking through the docs, I'm running across a number of missing links/issues.

Most importantly, within the Designing a new tag and schema page for the YAML Schema, The following key-value pair is mandated:

$schema: http://stsci.edu/schemas/yaml-schema/draft-01

Following the link, there's no document available under that URI.

P.S. Maybe you should switch to https for it too? :)

braingram commented 2 months ago

Thanks for opening the issue and for looking into ASDF. The standard is actively maintained and please let me know if you find any missing links.

The http (and https) uris (note not urls) have been a long-standing source of confusion and were one of the main motivations for moving new schemas to asdf (here's an example in a schema from the upcoming Roman mission). There is a note in the documentation about this but the tldr is that these are identifiers and do not have to resolve when visited with a web browser. In case it's helpful, the schema you mentioned is available in this repo.

I'm curious, for your use does ASDF have some benefits over jsonschema? Please feel free to go into detail and let me know if there's anything I can do to help.

braingram commented 2 months ago

I'm going to mark this as closed as the title of the issue is addressed. Please feel free to continue to respond to this issue with details or open another issue if there are broken links.

acovaci commented 2 months ago

Hi @braingram, thanks for the swift reply.

I have indeed come across #274 earlier, and left my input as well.

In short, there's two types of projects I am currently working on, where ASDF sounds much closer to the type of tool I would need:

  1. An internal data warehouse, with heavy data pipelines running all over the place :)
    • We considered the following tools: DBT schemas, Pydantic models, JSON schema, Protobuf.
      • We're currently relying exclusively on dbt schemas, but this has a number of limitations, like being tool-specific, and not having an easy way to reuse "sub-schemas", as well as not being able to easily share them.
      • JSON schema would be a second candidate for this, since in this specific use case we wouldn't be using binary data (at least for now). But there's no off-the-shelf integration with the framework we're using. I can go ahead and write one, but as written below, I'd do so for the most decoupled tool available :)
      • Protobufs are very obscure, at least in our department, and quite opinionated.
      • Pydantic models have similar limitations to dbt schemas, except they're more sharable.
    • In general, I try to decouple systems as much as possible, so having a filetype-independent standard to base our definitions on would be preferable to my style of development.
  2. A couple of command-line tools, highly configurable by the user. In order to provide maximal configuration options, I would want the user to have the option to use their preferred configuration option (e.g. JSON, YAML, VDF, TOML, etc.)
    • Specifically, for some of these, I need to include definitions of external files that are binary in nature - model weight dumps, audio files, etc.
braingram commented 2 months ago

Thanks for the details. ASDF is pretty tied to YAML and jsonschema (draft 4) at the moment. The format relies heavily on YAML tags. If I'm understanding your potential use (which I'm not sure I am) using the ASDF schemas with non-YAML data (like JSON, TOML) would require defining something like tags for those formats. I'm not familiar with VDF, what does it stand for? The tags are used for both mapping schemas to portions of the tree and for providing information about how those portions of the tree should be deserialized by the asdf implementation. I suspect that some formats might be easier to adapt (for JSON reserving $tag might be sufficient). Please let me know if I can help and I look forward to hearing more about your use case.

jdavies-st commented 2 months ago

@acovaci ASDF was developed by astronomers to be an open exchange format for astronomical data, but it is certainly a tool you might find useful for your work. Here's an example of a package that uses ASDF to exchange welding research data

https://github.com/BAMWelDX/weldx

People have also used the format to store audio files and associated metadata.

Much like JSON/JSON Schema and XML/XML Schema, ASDF essentially has 2 parts - (1) a defined file format that is essentially YAML with some data blocks attached, and (2) a schema which describes the specific file layout and validation tools to make sure the data that is read and written complies with schema. And of course there's software for reading and writing the files and doing validation. For standard JSON/YAML types (strings, numbers, etc) there are built in validators in the implemented asdf python package, but custom types and validators can easily be defined and implemented via a plug-in architecture, as done above in weldx.

ASDF schemas are essentially YAML, so architecture and programming language independent, but the reference implementation of asdf is in Python. So if you're coding in Python, you can get file reading and writing with standard data types for free, and you only need to implement your schemas and any custom data validation and (de-)serialization methods. There's been work done a C/C++ implementation of the library, but it is not feature complete.