ejp-rd-vp / resource-metadata-schema

Metadata model and schemas for the EJP virtual platform
https://ejp-rd-vp.github.io/resource-metadata-schema/
Creative Commons Zero v1.0 Universal
14 stars 10 forks source link

Building on standards #8

Closed ronaldcornet closed 5 years ago

ronaldcornet commented 5 years ago

The README describes that we build on standards. I would like to take this as far as possible, using URI's wherever we can. Now there is mention of "catalog", which I interpret as "ejp_rd:catalog". I would really like to see this changed to "dcat:catalog". In other words: adopt or explain. If dcat:catalog doesn't work, then make clear why and how we adhere and divert from it. In MarkW's strawman demo, which really helps to provide insight, a catalog has catalog_of_registries. In dcat, a catalog has datasets.

I strongly suggest to adopt the dcat modeling, or show why it's broken. And this also goes for other elements in our current model.

simonjupp commented 5 years ago

+1

The JSON-LD context provides the mappings to standard vocabularies. In here you will see I've already mapped our use of catalog to dcat:catalog. https://github.com/ejp-rd-vp/resource-metadata-schema/blob/master/ejp_vocabulary.jsonld.

Much of the model so far is based on dcat and builds on the representation for registries used by both MarkW and in the RD connect registry.

markwilkinson commented 5 years ago

+1

I'm writing code to generate the structures in this metadata schema, and I am simultaneously adding the additional facets that will make these compliant with DCAT.

For example, Catalog->about currently points to an array of CODE objects. In my "implementation", I am going to follow DCAT, and connect the Catalog to a SKOS ConceptScheme using the themeTaxonomy predicate. I will ALSO model the 'about' predicate pointing to each of the terms, so that the final output is compatible with both this proposed schema, as well as with DCAT.

as soon as it's ready for inspection, I'll send you a heads-up!

simonjupp commented 5 years ago

We need a way to express at the registry level what disease it is about. One of the main use-cases I've seen it to be able to query the VP for registries based on disease name or code so we should standardise how this is expressed at the registry level. I could probably be more explicit about the intended use of about and name it to is_about_disease? The range is a CODE block that is intended as a generic way to capture many different types of coding systems and would allow SKOS or OWL based vocabularies and other types of coding systems.

I think adding themeTaxonomy as an additional way to tag the registry with themes is fine but we should not conflate the intended use of the two predicates. What do you think?

rajaram5 commented 5 years ago

+1 It makes sense to add is_about_disease predicate to the registry resource as long as we also add the values of is_about_disease predciate to dcat:themeTaxonomy predicate. In this way clients which understands only dcat can also find registries based on diseases.

rajaram5 commented 5 years ago

BTW, why registries are of type ejp:Catalog/dcat:Catalog? Why not ejp:Dataset/dcat:Dataset?

markwilkinson commented 5 years ago

@rajaram5 in my interpretation of Simon's schema (and my code), Registries are dcat:Dataset and ejp:Dataset

Catalogue is the top top top level (e.g. the Catalogue of Registries)

In DCAT, a Catalog is connected to a ConceptScheme (themeTaxonomy). A Dataset (Registry) is connected to individual concepts in that scheme.

SO.... I think perhaps our interpretation of "Catalog" in Somon's schema is not the same as his interpretation! :-)

markwilkinson commented 5 years ago

Oooohhhh!! I think I see what Simon intended now! He intended there two be two specialized types of catalog! One for registries and one for biobanks!

I had been interpreting Catalog as being like the CoR, and then the Registry/Biobank to be "a member of" that Catalogue; but I think his schema doesn't have a "thing" like the CoR in it, now that I interpret the diagram a different way.

rajaram5 commented 5 years ago

Oooohhhh!! I think I see what Simon intended now! He intended there two be two specialized types of catalog! One for registries and one for biobanks!

I had been interpreting Catalog as being like the CoR, and then the Registry/Biobank to be "a member of" that Catalogue; but I think his schema doesn't have a "thing" like the CoR in it, now that I interpret the diagram a different way.

@markwilkinson this is also my interpretation of @simonjupp schema :-). So don't you agree it is an good idea to introduce CoR and Registry/Biobank as a member.

Orphanet commented 5 years ago

About disease, yes the first of all use case is to retrieve datasets (registries or biobanks) from any "catalog" by diseases. is_about_disease could reflect two differents searching situation. One is about the diseases links known for a resource in a specific catalog (the ones the resource has been annoted with), but in another way it will be usefull to search resources by a different level in a diseases classification schema. For instance if a resource is_about_disease: "Fabry", it's also related to "lysosomal diseases" group. I think we need to keep is_about_disease about direct linking and add another relation if needed of "inferred diseases" (if it's relevant to keep this information at the data level), or just leave it under the hood of the search functionnalities. (We will show some examples on monday)

simonjupp commented 5 years ago

Ahh, so I had been thinking about a registry as a special type of catalog not a dataset. Ultimately in the domain model we will talk about a PatientRegistry or Biobank as this is the terminology of the domain. I'll work on updating the model to be a bit more explicit and will add a CatalogOfRegistries too.

I'm happy for you to inform me on how you think the model maps to DCAT, but the model itself should be based on our understanding of the domain and the requirements of the VP and not how DCAT looks. Mark is making this translation layer through his implementation of the model.

simonjupp commented 5 years ago

About disease, yes the first of all use case is to retrieve datasets (registries or biobanks) from any "catalog" by diseases. is_about_disease could reflect two differents searching situation. One is about the diseases links known for a resource in a specific catalog (the ones the resource has been annoted with), but in another way it will be usefull to search resources by a different level in a diseases classification schema. For instance if a resource is_about_disease: "Fabry", it's also related to "lysosomal diseases" group. I think we need to keep is_about_disease about direct linking and add another relation if needed of "inferred diseases" (if it's relevant to keep this information at the data level), or just leave it under the hood of the search functionnalities. (We will show some examples on monday)

this kind of inference should absolutely happen in the search layer

ronaldcornet commented 5 years ago

Regarding is_about_disease:

I fail to understand the rationale for an is_about_disease predicate. What is the difference between:

What is wrong with / needed beyond dcat:themeTaxonomy ORDO:Fabry ?

Regarding registry to catalog/dataset mapping: I consider a registry as an organization that holds one or more registers. So, say, the BOND registry can have a dataset on incidence of rare bone diseases in all of Europe Further, it can have a dataset on treatment modalities for bone diseases in a subset of EU countries. Then this one registry will have more than one dataset. For that reason, I would consider the registry to have a catalog rather than be a catalog, but either way be mapped to catalog, with underlying possibly multiple datasets.

simonjupp commented 5 years ago

The range of dcat:themeTaxonomy is a SKOS concept scheme that states that entries in the catalog may use concepts from a particular concept scheme. This doesn't satisfy our requirement to capture what a registry is about.

simonjupp commented 5 years ago

dcat:theme could work so long as we are willing to map a registry to dcat:dataset.

ronaldcornet commented 5 years ago

Thanks, Simon. I was just trying to get my head around this. So: dcat:themeTaxonomy of a catalog specifies the vocab (Knowledge Organization System) to be used for dcat:theme of the constituting datasets in the catalog.

I can still imagine a registry having multiple datasets, considering it to be (mapped to) a catalog. Then the question is: do we need to "scope" the catalog? In my example above, for BOND we wouldn't specify the scope, but each dataset will have a dcat:theme.

simonjupp commented 5 years ago

@ronaldcornet I think the original issue has been address and that we are all agreed on building on existing standards where appropriate. If there are specific points to discuss in the schema let's create a new dedicated issues to track these.