gbif / registry

GBIF Registry
Apache License 2.0
34 stars 15 forks source link

GrSciColl - TDWG CD #176

Open MortenHofft opened 4 years ago

MortenHofft commented 4 years ago

This is not something we can do now - but at some point we need to decide on a model for integrating Collection descriptors and if we have any requirements for how they are structured (or if we transform them to the most machine meaningful denominator)

What is tdwg CD

The tdwg collection descriptors is a way to break your collection into a list of descriptors each saying something about a part of your collection. For illustrational purposes it can be simplified as something along:

[
  {
    taxa: [name: 'Plantae', colKey: 1],
    metrics: {
      specimens: 1000,
      types: 10
    },
    agents: [
      {
        personId: orcid/123
        role: CURATOR
      }
    ],
    countryCoverage: [ES],
    year: [1950, 1951, 1952]
  },
  ...
]

Interestingly there is nothing in the CD proposal that holds any other information. In a way that makes it very flexible. For us that would mean that the CDs are a nice supplement to GrSciColl and not a replacement.

How can they be used

Depending on the restrictions on the format of these, it allows varying degrees of discovery and rollups.

Scenario 1 Each CD has multiple taxa, locations and time ranges Search: We can search by all fields, but results are less useful as the user has less confidence that the taxon in case is also present at that location. Counts: we can not do any meaningful counts.

Scenario 2 Each CD is only about 1 taxa (species or group) 1 location and 1 time range. Search: Search results as before but more informative Counts: We can now do counts like "how many fungi specimens in DK"

The work is undergoing, but so far both scenarios are allowed and can be mixed freely. That makes it challenging to do any kind of counting across collections/institutions.

We can create synthetic CDs, say transform [3000 aves and mammals from spain and denmark] => [3000 animalia from europe] it is less precise, but at least we can now provide metrics, albeit with a cruder granularity.

How are they modelled, crawled, processed and linked

Exchange model Undecided. In many ways modelling them as a new DarwinCore core type makes good sense. It has all the star schema downsides of DwCA, but also the benefits of being well known and have existing tooling like the IPT. And if DwC evolves into supporting something like Frictionless data, then so would the new core type follow. A whole new format/standard is of course also an option.

Once this is is place we could consider deprecating metadata only datasets? Or is there a reason to keep accepting new ones?

Our internal model The incoming CDs could either be indexed as a regular dataset (say type: metadatav2) or it could not appear in datasets at all and instead live with GrSciColl as a separate entity. Adding it as a dataset among others is somewhat inline with @fmendezh work to create a richer dataset search with support for taxon and location search (based on precomputed occurrence facets).

Linking The tdwg CD has a field for collection Keys, those could simply point to the GrSciColl id (whatever those end up being).

dagendresen commented 3 years ago

You seem to describe collections as a "collection of data records" while I rather think of the CD collections as a collection of physical specimens. And thus the analog of treating collections simply as datasets makes no sense to my view of the CD collections concept :-)

MortenHofft commented 3 years ago

Hi @dagendresen thank you for reviving the issue. The TDWG CD standard (in progress) has changed quite a bit since this was written. I'm not up to date, but I suspect that the view above is somewhat dated. But even with that in mind I do not understand your comment. Could you expand please? In what way can CD not be considered a dataset with data records?

dagendresen commented 3 years ago

(I was lead to this GitHub issue from the GrSciColl roadmap - listed on the agenda for the Nodes Steering Group meeting yesterday).

Agree that CD is a data standard (for data records). But the data records in a "CD collection" would describe a real-world collection of real-world physical samples preserved in a natural history museum "collection". I thus see "datasets" and "collections" as fundamentally different.

Data records in a CD collection are defined by which physical specimens are organized in the real-world museum collection for the purpose which is useful for the preservation of the physical specimens.

I was wondering what you meant by "synthetic CDs"?

dagendresen commented 3 years ago

My image is that a museum collection is a real-world thing, and is composed of real-world specimens. And that the specimens are fundamentally distinct from the Darwin Core Occurrence records. Zero, one, or multiple Occurrence records could be linked in different ways to a specimen. In my view, we should rather start with a catalog of the real-world collections and specimens. And then try to link the Occurrences to this catalog -- rather than the other way round.

(I am probably commenting on the wrong thread here) :-)

MortenHofft commented 3 years ago

Thanks Dag - I do not see any disagreement then - just me not being precise enough in the issue then.


I was wondering what you meant by "synthetic CDs"?

Simply that - at the time of writing - CDs allowed something like specimens: 1000, taxon: Anura, country: Denmark and Sweden. That kind of CDs are difficult to use for metrics (like a dashboard). But it can transformed into the less informative specimens: 1000, taxon: Anura, country: EUROPE that can be used for metrics. But the standard has evolved so much, that considerations on the model is probably less relevant by now.

dagendresen commented 3 years ago

Another potential issue when thinking metrics (by current practice at the museums) would be that the same specimen could be a member of multiple "collections". Not sure how or if TDWG CD tackles this... :-) Maybe distinguish "thematic collections" from more "fundamental" (??) collections.