Closed cmdoret closed 1 year ago
Hello! Thank you so much for taking the time of opening an issue.
I think that reusing standards is a must, as I mention in the README.
I hadn't known about jsonschema
until now (whoops!). It looks really promising, and as you point out, it would be wise to reuse it, and I broadly think it would be a good change.
However, I want a specification that could be read and written by anyone (with a touch of training, of course), especially people who do not work outside of Google Docs, Word and occasionally PRISM (like the wet-lab researchers in my group).
This is why RO-crate
, json-ld
and similar specs did not satisfy me. While jsonschema
is a bit easier, I'm a bit afraid that it is too complex. The pros I see would be:
But I also see some cons:
required
and valid_values
keys, so extending it to format
, pattern
etc... is too much.{type: {"type": "string", "enum": ["person"]}}
is redundant. This is the same concern as with RO-crate having the entry of the metadata file itself + the entry for the bundle as a whole.definitions
. Is there a way to do this? It might help for omonim keys, but I would like to avoid those if possible.{"@specification": ["url", "url2"]}
) "feature" of myr
is really nice, IMO. This way, we can create and mix/match different sets of specifications for different use-cases. E.g. we could make a "base" spec and various experiment-type specific specifications, and write {"@specification": ["base.json", "microscopy.json"]}
. Would jsonschema
be easily "summed"? I'm not sure... :shrug:This said, it could be a nice format to adhere to, but we would then have to have an extra step in our metadata workflow:
"specification"
, in the jsonschema format;myr
makes an empty .txt
(toml? Yaml?) file with all the keys / values / requirements in a human-readable format (with emojis, maybe?)myr
takes this .txt
file and parses it, validates it against the schema, and makes the working copy / frozen copy of the bundle.The extra abstraction step would make the actual implementation (like the jsonschema
) irrelevant for the researcher, so it allows more freedom machine-wise. But it would need quite a bit of extra programming to make.
I hope I was clear enough. Let me know what you think!
First, to answer your questions about jsonschema:
I don't see a way of reusing keys in different definitions. Is there a way to do this?
I think you would do this with $def
the ability to add more keys to the bundle than the ones specified by the structure is a must for me. Would that be possible with json schema
I believe additional properties are allowed by default (but can be disabled if needed).
Would jsonschema be easily "summed"?
You can refer to objects in different schemas using $ref
, but it is not as readable / explicit as the myr
way.
That said, I agree with you that jsonschema is not a desirable input format. What you are describing in your metadata workflow, combined with your initial statement "I want a specification that could be read and written by anyone (with a touch of training, of course), especially people who do not work outside of Google Docs, Word and occasionally PRISM" strongly reminds me of linkml.
While linkml is doing a lot more than what is actually needed here (auto-generating docs, code, schemas in multiple language), their input format seems to match your requirements. Below is an example based on one of their tutorials. Their data model is IMO expressive enough so that it can be read by anyone. It is not too verbose, so pretty comfortable to write. The tooling is a bit overwhelming (does many things), but can actually generate jsonschema from this yaml file, so very similar to what you were describing.
One thing that may seem overkill / clunky from the outside is that they use the notion of prefixes. Basically they use the linked data paradigm where everything has to be a URI. This means that if your schema's default URI is https://example.org
and defines a property name
, this property should be referred globally to as https://example.org/name
(if you alias the URI to example
, this would be shortened to example:name
).
Using URIs adds a bit of noise, but it is kind of necessary to enforce the F1 principle of FAIR.
What do you think?
id: https://w3id.org/linkml/linkml-tutorial
name: linkml-tutorial
prefixes:
linkml_tutorial: https://w3id.org/linkml/linkml-tutorial/
default_prefix: linkml_tutorial
default_range: string
classes:
Person:
slots:
- id
- name
- age_in_years
- birth date
- pets
Animal:
slots:
- id
- name
- species
- age_in_years
- birth date
slots:
id:
required: true
range: uriorcurie
description: A unique identifier for a person
name:
description: A human-readable name for a person
birth date:
range: date
description: Date on which a person is born
age_in_years:
range: integer
description: Number of years since birth
pets:
description: a collection of animals that lives with and is taken care of by a person.
multivalued: true
range: Animal
species:
description: The species of an animal
Hey there!
Thanks for clarifying the points on jsonschema. I now believe that, as a way to specify the frozen bundle, jsonschema
is probably the way.
Thanks for bringing up linkml
too. It really looks very similar to what I've thought for myr
, and the specifications could be "summed".
It could be viable to have something like this:
linkml
specification(s), saving them remotely.myr-metadata.json
file, and adds in the linkml
models needed.myr
handles, at runtime, converting the likml
models to a single jsonschema
(using the linkml
functions)jsonschema
is then used to validate the user input, when it comes in.This implies the following:
myr
to be written in Python (and if this is to become an open standard, I think it's better regardless) so we can use linkml
as a lib. This is a good thing.myr-metadata
to be YAML
and not JSON
, so we align better with linkml
, and it's easier to write. Plus, with the YAML
easy-to-use tag/reference feature we get relative keys (the keys starting with >
) basically for free! This is also a good thing.jsonschema
we can leverage validators that already exist, freeing us from writing our own validator.linkml
does a lot of things, most of which we don't care about. Wouldn't it be better to just write our own tiny parser?jsonschema
by themselves, without the need for an extra step. If we found a way to merge jsonschemas together (which I guess it's pretty simple, as each type is encapsulated), we can probably skip the extra step.myr-metadata.yaml
specification, the user can just fill in the metadata, the back-end/configuration would be much more flexible, as we could make it more complex without impacting (a lot) usability. RO-Crate has something like this (Describo), but using it is really confusing (as RO-Crate is so large).However, it's important to keep in mind that myr
does not want to be a new way to write metadata forever: it's a band-aid / stepping stone for people who don't do FAIR to start using it without a lot of fuss, so we must be careful not to overshoot it. Especially points 5 (what if there is no bioinfomatician on your team?) and 6, but also 3 (as we already pointed out, jsonschema
is also bulky).
We would need to strike a balance, from very flexible but unshapen (like myr
is now), to something a bit more structured (like linkml
) to something strict but that can be well validated (like jsonschema
).
I think that we (I say we very loosely) can start implementing:
yaml
and not json
for the working copy (but keep using json
for the frozen bundle).jsonschema
, implementing a way to "sum up" the schemas.The workflow, per these changes, would be as follows:
myr-metadata.yaml
file;myr
to generate a (yaml) template for the data entry;myr
to validate and freeze the bundle, which you can then upload.Let me know what you think!
P.S: One more thing: I think the F1 principle for every key does not apply for myr
, since the description of the key is already next to the key. Keep in mind the "stepping stone" point from above. I guess that myr bundles would not be perfectly findable (at the level of the keys in the bundle), but... baby steps?
Hello @cmdoret ! I'm closing this issue because I've taken a look at ISON schema and while it would be a very nice addition I feel like the "addition" of different specifications would be too hard using it as of now. I'm writing the actual implementation of the tool, so I'm sticking with the initial idea. If it turns out to be a valid change, We can always implement it later on.
I've also moved the specification and its discussion to https://github.com/mrhedmad/data-myr-spec instead of this repo, to keep the tool separate from the specification.
Hey @MrHedmad, sorry i forgot to answer before, but that makes sense! I'll keep an eye on myr! :)
Hi @MrHedmad,
I really like the idea of myr and it looks promising. My experience with ro-crate has been similar in the past. I had a few questions / suggestions regarding the specification.
The contents of "specification" in metadata.json look a lot like jsonschema. Is there any reason not to reuse [a subset of] jsonschema directly? This would allow reusing existing validators. From your example, I think the jsonschema may look something like this, do you think it is too verbose / obscure for users?
The concept of remote keys being expanded in frozen copies is really nice. It seems this can also be done with jsonschema using pointers but I've never tried it.
Alternatively, since you mention schema.org, what do you think of json-ld / rdf? In the README, you mentioned
RDF can help with this and there are even tools like SSOM that make this easier. RDF / json-ld is a bit hard to understand / write and so may not be a good solution for data entry but might be interesting as an export format for frozen bundles? Not sure if that would make sense.
Maybe all that is too complex / out of scope, just wanted to hear your thoughts on it :)