MrHedmad / data-myr

A way to locally manage data in a FAIR way
MIT License
2 stars 0 forks source link

Reusing existing standards in the specification #1

Closed cmdoret closed 1 year ago

cmdoret commented 1 year ago

Hi @MrHedmad,

I really like the idea of myr and it looks promising. My experience with ro-crate has been similar in the past. I had a few questions / suggestions regarding the specification.

The contents of "specification" in metadata.json look a lot like jsonschema. Is there any reason not to reuse [a subset of] jsonschema directly? This would allow reusing existing validators. From your example, I think the jsonschema may look something like this, do you think it is too verbose / obscure for users?

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "definitions": {
    "person": {
      "type": "object",
      "properties": {
        "name": {
          "type": "string",
          "description": "The name of a real person."
        },
        "ORCID": {
          "type": "string",
          "pattern": "^\\d{4}-\\d{4}-\\d{4}-\\d{4}$",
          "description": "An ORCID id."
        },
        "email": {
          "type": "string",
          "format": "email",
          "description": "An e-mail address."
        }
        "type": {
          "type": "string",
          "enum": ["person"]
        }
      },
      "required": ["name", "type"]
    },
    "file": {
      "type": "object",
      "properties": {
        "path": {
          "type": "string",
          "description": "The path to the file."
        },
        "MIME_type": {
          "type": "string",
          "description": "The MIME type of the file."
        },
        "author": {
          "$ref": "#/definitions/person",
          "description": "The author of the file."
        },
        "date": {
          "type": "string",
          "format": "date",
          "description": "The date the file was created."
        },
        "type": {
          "type": "string",
          "enum": ["file"]
        }
      },
      "required": ["path", "MIME_type", "type"]
    }
  }
}

The concept of remote keys being expanded in frozen copies is really nice. It seems this can also be done with jsonschema using pointers but I've never tried it.

Alternatively, since you mention schema.org, what do you think of json-ld / rdf? In the README, you mentioned

Then, once (and if) a global standard is defined, you can migrate your data to that standard (in some way).

RDF can help with this and there are even tools like SSOM that make this easier. RDF / json-ld is a bit hard to understand / write and so may not be a good solution for data entry but might be interesting as an export format for frozen bundles? Not sure if that would make sense.

Maybe all that is too complex / out of scope, just wanted to hear your thoughts on it :)

MrHedmad commented 1 year ago

Hello! Thank you so much for taking the time of opening an issue.

I think that reusing standards is a must, as I mention in the README. I hadn't known about jsonschema until now (whoops!). It looks really promising, and as you point out, it would be wise to reuse it, and I broadly think it would be a good change.

However, I want a specification that could be read and written by anyone (with a touch of training, of course), especially people who do not work outside of Google Docs, Word and occasionally PRISM (like the wet-lab researchers in my group). This is why RO-crate, json-ld and similar specs did not satisfy me. While jsonschema is a bit easier, I'm a bit afraid that it is too complex. The pros I see would be:

But I also see some cons:

This said, it could be a nice format to adhere to, but we would then have to have an extra step in our metadata workflow:

The extra abstraction step would make the actual implementation (like the jsonschema) irrelevant for the researcher, so it allows more freedom machine-wise. But it would need quite a bit of extra programming to make.

I hope I was clear enough. Let me know what you think!

cmdoret commented 1 year ago

First, to answer your questions about jsonschema:

I don't see a way of reusing keys in different definitions. Is there a way to do this?

I think you would do this with $def

the ability to add more keys to the bundle than the ones specified by the structure is a must for me. Would that be possible with json schema

I believe additional properties are allowed by default (but can be disabled if needed).

Would jsonschema be easily "summed"?

You can refer to objects in different schemas using $ref, but it is not as readable / explicit as the myr way.

That said, I agree with you that jsonschema is not a desirable input format. What you are describing in your metadata workflow, combined with your initial statement "I want a specification that could be read and written by anyone (with a touch of training, of course), especially people who do not work outside of Google Docs, Word and occasionally PRISM" strongly reminds me of linkml.

While linkml is doing a lot more than what is actually needed here (auto-generating docs, code, schemas in multiple language), their input format seems to match your requirements. Below is an example based on one of their tutorials. Their data model is IMO expressive enough so that it can be read by anyone. It is not too verbose, so pretty comfortable to write. The tooling is a bit overwhelming (does many things), but can actually generate jsonschema from this yaml file, so very similar to what you were describing.

One thing that may seem overkill / clunky from the outside is that they use the notion of prefixes. Basically they use the linked data paradigm where everything has to be a URI. This means that if your schema's default URI is https://example.org and defines a property name, this property should be referred globally to as https://example.org/name (if you alias the URI to example, this would be shortened to example:name).

Using URIs adds a bit of noise, but it is kind of necessary to enforce the F1 principle of FAIR.

What do you think?

id: https://w3id.org/linkml/linkml-tutorial
name: linkml-tutorial

prefixes:
  linkml_tutorial: https://w3id.org/linkml/linkml-tutorial/

default_prefix: linkml_tutorial
default_range: string

classes:
  Person:
    slots:
      - id
      - name
      - age_in_years
      - birth date
      - pets

  Animal:
    slots:
      - id
      - name
      - species
      - age_in_years
      - birth date

slots:
  id:
    required: true
    range: uriorcurie
    description: A unique identifier for a person
  name:
    description: A human-readable name for a person
  birth date:
    range: date
    description: Date on which a person is born
  age_in_years:
    range: integer
    description: Number of years since birth
  pets:
    description: a collection of animals that lives with and is taken care of by a person.
    multivalued: true
    range: Animal

  species:
    description: The species of an animal
MrHedmad commented 1 year ago

Hey there!

Thanks for clarifying the points on jsonschema. I now believe that, as a way to specify the frozen bundle, jsonschema is probably the way. Thanks for bringing up linkml too. It really looks very similar to what I've thought for myr, and the specifications could be "summed".

It could be viable to have something like this:

This implies the following:

  1. It would be better for myr to be written in Python (and if this is to become an open standard, I think it's better regardless) so we can use linkml as a lib. This is a good thing.
  2. It's better for myr-metadata to be YAML and not JSON, so we align better with linkml, and it's easier to write. Plus, with the YAML easy-to-use tag/reference feature we get relative keys (the keys starting with >) basically for free! This is also a good thing.
  3. Using jsonschema we can leverage validators that already exist, freeing us from writing our own validator.
  4. We would need to import and work with a (rather bulky) library. As you said, linkml does a lot of things, most of which we don't care about. Wouldn't it be better to just write our own tiny parser?
  5. The bioinformatician can probably write a jsonschema by themselves, without the need for an extra step. If we found a way to merge jsonschemas together (which I guess it's pretty simple, as each type is encapsulated), we can probably skip the extra step.
  6. If we were to implement a front-end where, given a myr-metadata.yaml specification, the user can just fill in the metadata, the back-end/configuration would be much more flexible, as we could make it more complex without impacting (a lot) usability. RO-Crate has something like this (Describo), but using it is really confusing (as RO-Crate is so large).

However, it's important to keep in mind that myr does not want to be a new way to write metadata forever: it's a band-aid / stepping stone for people who don't do FAIR to start using it without a lot of fuss, so we must be careful not to overshoot it. Especially points 5 (what if there is no bioinfomatician on your team?) and 6, but also 3 (as we already pointed out, jsonschema is also bulky). We would need to strike a balance, from very flexible but unshapen (like myr is now), to something a bit more structured (like linkml) to something strict but that can be well validated (like jsonschema).

I think that we (I say we very loosely) can start implementing:

The workflow, per these changes, would be as follows:

Let me know what you think!

P.S: One more thing: I think the F1 principle for every key does not apply for myr, since the description of the key is already next to the key. Keep in mind the "stepping stone" point from above. I guess that myr bundles would not be perfectly findable (at the level of the keys in the bundle), but... baby steps?

MrHedmad commented 1 year ago

Hello @cmdoret ! I'm closing this issue because I've taken a look at ISON schema and while it would be a very nice addition I feel like the "addition" of different specifications would be too hard using it as of now. I'm writing the actual implementation of the tool, so I'm sticking with the initial idea. If it turns out to be a valid change, We can always implement it later on.

I've also moved the specification and its discussion to https://github.com/mrhedmad/data-myr-spec instead of this repo, to keep the tool separate from the specification.

cmdoret commented 1 year ago

Hey @MrHedmad, sorry i forgot to answer before, but that makes sense! I'll keep an eye on myr! :)