RDA-DMP-Common / RDA-DMP-Common-Standard

Official outputs from the RDA DMP Common Standards WG
The Unlicense
61 stars 34 forks source link

Add specification for how to extend the schema #27

Open briri opened 4 years ago

briri commented 4 years ago

We are currently converting our API over to use this Common Standard metadata schema. We have a few scenarios where we also need to convey information that its required for our system but outside the scope of this schema.

It would be good if the schema provided guidance on how best to include this type of information. So that systems adopting the Common Standard schema follow similar patterns.

For example, the DMPTool API requires that a DMP template identifier be specified along with some other information specific to the caller's system (called 'abc' below) when creating a new DMP.

We will be using the following structure to accomplish this:

{
  "dmp": {
    "title": "My new DMP",
    ...
    // the rest of the common standard attributes
    ...
    "extended_attributes": [
      "dmptool": { "template_id": "123" },
      "abc": { "reserve_id": { "type": "doi", "identifier": "https://dx.doi.org/10.9999/999xyz" } }
    ]
  }
} 

Apologies if this has already been discussed and I just missed it in the documentation somewhere.

hmpf commented 4 years ago

Would it be relevant to have extensions elsewhere than the top level as well? Like, extra stuff for host/distribution for instance.

briri commented 4 years ago

I can see value in allowing for extensions at the dataset, distribution and host levels (perhaps project as well). For us (so far) the use case for using extensions has been in the import (creation) of a DMP via our API. It could be useful as well in

We ended up using the following during the hackathon:

{
  "dmp": {
    "extension": [
      {
        "dmptool": {
          "template": {
            "id": 946,
            "title": "Environmental Resilience Institute Data Management Plan"
          }
        }
      }
    ]
  }
}

Related issue: https://github.com/RDA-DMP-Common/hackathon-2020/issues/3

TomMiksa commented 3 years ago

Do you have more examples of extensions needed? This could help us find the best strategy for including them.

What about doing it in a slightly different way by using within the dmp section a field to define extensions. This would indicate at the beginning what specific extensions are used and hence what additional fields are to be expected. Each extension must be identified by an URL to a JSON schema. For example:

{
  "dmp": {
     ...
    "extensions": [
      "http://json-schema.org/dmptool",
      "http://json-schema.org/funderX"
    ],
    ...
    "dataset": [
      {
        "title": "My Dataset",
        "dmptool-specific-field": "generated by DMPTool"
       ....
      }
    ]
  }
}
briri commented 3 years ago

I think that could be a useful approach.

We are currently working through an integration that is using the common standard as the method of communication. We are still in the early stages of the project though and have not finished defining what additional information we would like to pass along. Much of the information is at the project/dmp level for example:

cpina commented 3 years ago

I'm new here - sorry if I miss-interpreted something in this issue. I was in the call earlier on and I thought of adding some of my thoughts here.

{
  "dmp": {
     ...
    "extensions": [
      "http://json-schema.org/dmptool",
      "http://json-schema.org/funderX"
    ],
    ...
    "dataset": [
      {
        "title": "My Dataset",
        "dmptool-specific-field": "generated by DMPTool"
       ....
      }
    ]
  }
}

I like it. In the Frictionless Data community we had a similar discussion: https://github.com/frictionlessdata/specs/issues/663

In that case we were looking at adding specific fields. E.g. at the Swiss Polar Institute we are prefixing them with x_spi_: https://github.com/Swiss-Polar-Institute/frictionless-data-packages/blob/master/10.5281_zenodo.2616605/datapackage.json#L146 It makes clear that these fields are extensions from SPI (the approach in this issue also makes it clear).

Only one possible problem (hypothetical) with the current suggestion: might two institutions come up with two extensions with the same name and some fields would be the same? I can think of two possible solutions:

froggypaule commented 3 years ago

hello ... also following this morning's call. Thanks to @cpina : this is the reservation I was trying to convey at the call:

  1. If fields coming from two different extensions share the same name and the same meaning, then all is well: they are simply mapped one onto the other.
  2. If fields coming from two different extensions share the same name BUT not the same meaning, then the solution proposed by @cpina would work.
  3. If fields coming from two different extensions do NOT share the same name but the same meaning: again, un mapping would do the trick
  4. If fields coming from two different extensions do NOT share the same name NOR the same meaning, then all is well also.

Sorry if I misunderstand the question.

cpina commented 3 years ago

hello ... also following this morning's call. Thanks to @cpina : this is the reservation I was trying to convey at the call:

1. If fields coming from two different extensions share the same name and the same meaning, then all is well: they are simply mapped one onto the other.

2. If fields coming from two different extensions share the same name BUT not the same meaning, then the solution proposed by @cpina would work.

3. If fields coming from two different extensions do NOT share the same name but the same meaning: again, un mapping would do the trick

4. If fields coming from two different extensions do NOT share the same name NOR the same meaning, then all is well also.

This is a perfect summary, thanks!

Sorry if I misunderstand the question.

My thoughts are: should we make work the case 2? (two different extensions share the name, share a field name and not the same meaning). If this is a concern and should work: what's the best way to go? (a "name" or a "prefix")

briri commented 3 years ago

We are going to begin work on the schema extensions for DMPRoadmap in late March or early April.

We will plan to follow the pattern described by @cpina @froggypaule above by using a tool/codebase specific prefix like:dmproadmap-[x].

Any early suggestions or feedback (once we start work on it) would be welcome. :)

froggypaule commented 3 years ago

Hello! a quick one: why the name 'dmproadmap' ? I am saying this because that DMPRoadmap is the common code base to DMPTool and DMPonline. Is the name intentional?

briri commented 3 years ago

Yes. Any changes we'd be making would benefit the entire codebase (DMPTool, DMPonline, DMPOPIDoR, DMPAssistant, etc.).

For example the DMPRoadmap system is driven in part by specific templates (e.g. Horizon2020, NSF, USGS, etc.). We have an API endpoint that allows users to create a DMP by passing in this metadata standard. To help facilitate the use of specific templates we would add a dmproadmap_template_id or something similar to convey that information to the system.

froggypaule commented 3 years ago

ok thanks.... I was just commenting :)

paulwalk commented 3 years ago

Hi - I've been reading this thread, and I'm concerned that the consensus seems to be to invent a mechanism for handling namespaces in JSON.

I would strongly recommend not doing this.

At the start of this work, we decided to limit our focus and ambition with the standard, so that it was developed and managed as an information exchange format. More formally, it could be described as a metadata application profile. However, the interest in this work has grown and, as such, we are now faced with a decision. Do we accept that there is demand for a more expansive standard - essentially an ontology within which new concepts can be added? Or do we continue to limit our scope, while recognising that there is demand to include extra information in, or alongside, the information exchange?

As I understand it, there are two viable options available to us:

Option 1: Widen our scope, and become an ontology

It could be argued that this is inevitable. In any case, there is already work underway to formally describe the standard as an OWL ontology, so there does appear to be demand for this. If this is the direction of travel for the DMP Common Standard, then I would recommend that we act sooner rather than later, and move from supporting plain JSON to supporting JSON-LD.

Pros:

Cons:

Option 2: Continue as before, with a new section for arbitrary extensions

We had certainly been considering how to handle extensions from the beginning of this work, and this was our original idea. With this approach, the scope of the DMP Common Standard is unchanged, but a place is added for third-parties to add arbitrary data. With this approach, the DMP Common Standard has nothing to say about how these extensions are encoded. If implementers add extensions which cause name collisions, then they will need to sort this out (typically by agreeing conventions).

Pros:

Cons:

My recommendation:

  1. Absolutely do not invent a new mechanism for name-spacing JSON properties as part of the DMP Common Standard
  2. Consider the implications of moving to JSON-LD. In many cases, it may simply involve adding a context to the JSON, and changing to a JSON-LD software library for parsing. However, there may be other issues for the software that has implemented the standard. It would be good to find out - how disruptive actually is this?
  3. If not moving to JSON-LD, then define the place for extensions (as already suggested above) and then say no more. Make it clear that all further definition is out of scope for this standard. However, we could consider providing a place for implementers to document "community conventions" for using these.

Of these two options, I think that the JSON-LD option is the more future-proof at this point.

froggypaule commented 3 years ago

Thanks @paulwalk for clairifying this: having come to the CS quite late, this helps a lot. And yes, I agree with you on JSON-LD and option 1 (not that I am quite versed in these matters....)

cpina commented 3 years ago

Thanks @paulwalk . Sadly I'm not extremely familiar with JSON-LD and I need to do some refreshing on it. I 100% agree to avoid reinventing the wheel. If any of the ideas of my suggestions already exist in a standard I would say to go with the standard unless there is a very good reason for this use-case.

fekaputra commented 3 years ago

Hi @paulwalk, in case it is decided that the community will go with the first option, we (mainly me, @JoaoMFCardoso, @ljgarcia and Marie-Christine) have been working on the ontology version of the DMP Common Standard (DMP Common Standard Ontology - DCSO), which is already committed as a part of this repository (https://github.com/RDA-DMP-Common/RDA-DMP-Common-Standard/tree/master/ontologies). This was a result of the DCS hackathon last year.

The goal of the ontology is to have a 1-to-1 mapping to the current DCS, to ensure the compatibility between the DCSO and the original DCS standard.

We will be very happy to discuss the ontology development (which you can later serialise as JSON-LD) to include the latest changes since the hackathon if you wish.

As a note, we are currently working on an (invited) journal paper to showcase the DCSO and its features. So in case that the community decided to go with the JSON-LD, we can also report this development in the paper as well.

MarekSuchanek commented 3 years ago

Hi, I would vote for JSON-LD way.

@paulwalk It should be possible to remain backwards compatible (when someone ignores @context, @type, etc., the structure can be done in the same way as is now), right? Question is if that is a good idea or it would be better to work directly on some redesign (again, sooner rather than later)...

One might also ask why JSON-LD and not directly RDF.

paulwalk commented 3 years ago

@paulwalk It should be possible to remain backwards compatible (when someone ignores @context, @type, etc., the structure can be done in the same way as is now), right? Question is if that is a good idea or it would be better to work directly on some redesign (again, sooner rather than later)...

I think it would remain backwards-compatible for people parsing the document as JSON rather than JSON-LD. As far as I can see, the main thing that would be lost would be the namespace URI mapping - but the namespace prefixes would still be in the JSON.

One might also ask why JSON-LD and not directly RDF.

This is really just about tooling. The DMP system APIs are already handling JSON. Developers mostly prefer it to RDF because they get native programming language support etc. JSON-LD seems to hit the "sweet-spot" for many.

nicolasfranck commented 1 year ago

I think the use of JSON-LD would only break usage if you would decide to use a different way of expressing your attributes. JSON-LD allows for short attribute names or expanded names (name vs http://schema.org/name), compacted result or not; allows to express your values as regular strings, array of strings, array of hashes ..

A little sidewalk: IIIF v2 uses JSON-LD, but implementers rapidly started to realise that attribute values can be anything (reference url? regular string? array of reference urls?). IIIF v3 therefore decided to be far more strict;

And that is what one should probably do to make other developers' live easier. Let's not forget that most JSON parsers are just JSON parsers, and are not like XML parsers that can handle namespaces.