gbif / model-material

Data model research focused on richer data for a material catalogue
7 stars 8 forks source link

arctos: tatertots! #110

Closed MortenHofft closed 1 year ago

MortenHofft commented 1 year ago

@dustymc I'm not sure if it is a joke due to missing data or a mistake. But the value tatertots! appear a lot in various files

dustymc commented 1 year ago

Let's call it a plea for information.

 'tatertots!', -- not a clue but can't be null....
MortenHofft commented 1 year ago

great - that sounds like something @tucotuco would find useful to know. That we have fields that either needs more documentation and/or that should allow null values.

I see it in EntityTable: dataset_id

... ohh! - that is it. I was convinced that I had seen it in multiple files. I was wrong. Let us just close this then

tucotuco commented 1 year ago

Where are all the places where we have tater tots? BTW, how I miss those.

On Tue, Mar 21, 2023, 10:02 dustymc @.***> wrote:

Let's call it a plea for information.

'tatertots!', -- not a clue but can't be null....

— Reply to this email directly, view it on GitHub https://github.com/gbif/model-material/issues/110#issuecomment-1477799498, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADQ725H7L3ISZEYJ2LB7LDW5GRF3ANCNFSM6AAAAAAWCGJYVU . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Jegelewicz commented 1 year ago

gimme somma your tots!

tucotuco commented 1 year ago

OK, tatertots are for the dataset_id for DigitalEntities. @dustymc Is it nonsense to assign a media item to a dataset? Knowing the reasoning could be sufficient to relax the constraint, which was implemented specifically for this test.

dustymc commented 1 year ago

I don't think I know what a "dataset" is, but probably - media aren't "owned" except by association, and that can involve many collections. "Arctos"??

tucotuco commented 1 year ago

@timrobertson100 asked for this field to be added (with that constraint), I believe as a way to be able to do dataset-level statistics and such, where it was thought that all of the participants would be dataset identifiers in GBIF already. As such, "Arctos" would not fit, as there is not "Arctos" dataset because they are divided up by institutions and collections. This case makes it clear to me that the constraint, at the very least, is not appropriate.

timrobertson100 commented 1 year ago

I'm happy to see the constraint removed, but I'll try and explain the motivation for why it was there.

Data is and will be shared from various sources and integrated. Our clustering detects some of this now, but we don't project the data onto a sensible structure (i.e. we only link records from DwC-A). Probably this approach for a datasetId was too naive, but it was an attempt to be able to at least identify from which source the data originated necessary for attribution, connecting people (who is the technical contact I can talk to about this data), and diagnostics.

Bearing in mind this is geared towards dataset exports (i.e. how GBIF currently operates) and not a live data environment... if you have an idea of how the data might be packaged (by collection, by project, all in one Arctos dataset etc) I'd suggest trying to put in e.g. arctos:collection:<collectionCode> or similar for the time being. In practice, whatever is integrating the data would likely be responsible for populating it.

Does that seem reasonable, please?

dustymc commented 1 year ago

Speaking only for myself and in general: Please just use our identifiers. https://arctos.database.museum/collection/MSB:Mamm, https://arctos.database.museum/guid/MSB:Mamm:259087, https://arctos.database.museum/media/10310948, etc are unambiguous in any environment. If something we don't have needs added for "DWC reasons" or to support others or WHATEVER that's fine, but it's not what I'd like to see cited.

Jegelewicz commented 1 year ago

@dustymc I think the issue here is our "media dataset" which doesn't have an "Arctos identifier". It seems fine to make one for the purposes of this exercise? arctos:media:<media>? Or maybe I am misunderstanding what is being asked.

dustymc commented 1 year ago

It seems fine to make one for the purposes of this exercise? arctos:media:?

What is <media> (literal??)

Yes, adding some ephemeral identifier to WHATEVER is fine, I'd just not like to have to try to make sense of it later, and I'd really like to not make anyone think it's a great thing with which to provide attribution.

Jegelewicz commented 1 year ago

What is (literal??)

Good question. I don't know where to find definitions for all of this stuff. I don't see an equivalent in GBIF, but in GRSciColl I see this:

https://www.gbif.org/grscicoll/collection/b20e56a9-1687-4adb-985b-8d8947b7f1ba

image

Which feels wrong and seems like it should be MSB:Fish

Anyhow, for the dataset above, what would @timrobertson100 expect to see in dataset_id?

I would expect https://arctos.database.museum/collection/MSB:Fish are we on the same page? And if so, could we just use arctos:media as the dataset_id for our media "dataset"?

dustymc commented 1 year ago

I would expect https://arctos.database.museum/collection/MSB:Fish

See https://github.com/gbif/model-material/issues/110#issuecomment-1480152192 - the media object may be used by 97 collections, not only one. (Zero is another common value - eg lots of habitat images are linked to collecting events which are used by records in collections instead of directly attached to the records.)

Jegelewicz commented 1 year ago

Yes, I understand - the Fish thing is probably just confuting the issue?

Because we do not have a "media" dataset, let's just use "Arctos:Media" or something like that.

tucotuco commented 1 year ago

I think there is a conflation of Dataset (one example and another example) and Collection (example) going on. In the Unified Model scenario, there is just one Arctos dataset and all of the data that come in that dataset should have the same dataset identifier. No need to be fancy, it would be fine to use "Arctos" (or "tatertots").

If you agree, no need to read any further. I'm just going to try to justify what I already said.

Except that this exercise is really to investigate a Material Catalogue, I think it is also artificial to have a dataset_id on Entity and not on any other class.

In the DwC scenario, we published datasets for every Arctos collection via IPT. A Dataset was produced (and versioned) each time the data were published. The Collection wasn't versioned, it was a Collection, not a Dataset. There is nothing in the broader data publishing scenario that says a Collection must be in only one dataset, nor that a Dataset must consist of data from only one Collection. They are different concepts. I think the problem in this exercise is trying to make the dataset_id do the job of both.

dustymc commented 1 year ago

They are different concepts.

Yes, and I've known that from time to time, probably....

We've also done some weird thing where Dave stands on one foot facing East and prefixes a bunch of whatevertheyare with "Arctos: ImSureStringsWillWorkHereEvenThoughThatnNeverHappenedBeforeEver:" or somesuch nonsense in a probably-futile attempt to -- uhhh, I'm not actually sure why, but it probably seemed like a good idea to someone at the time, and I will admit to finding it handy to be able to filter (ish, probably) Arctos collections in the IPT. Anyway, I think we'd like to have another identifier.

I'll change tatertots! to https://arctos.database.museum for next export.

Jegelewicz commented 1 year ago

institution - there's no such thing in Arctos

yes and no? Every collection has an "Institution"

image

dustymc commented 1 year ago

True - there's a string (couple of them maybe), there's not a data object or even any attempted control of the string.

timrobertson100 commented 1 year ago

I'll change tatertots! to https://arctos.database.museum/ for next export.

Thanks, @dustymc - for this and all the other interactions.

I think it is also artificial to have a dataset_id on Entity and not on any other class.

Agree and it's actually why I said a "naive attempt" above. I'm hoping to demonstrate some links detected from clustering in the not-too-distant future and the entity.datasetID is suffice for that.

Jegelewicz commented 1 year ago

there's not a data object or even any attempted control of the string.

Perhaps we need an institution metadata page? Or is the agent good enough? https://arctos.database.museum/agent/21334566

tucotuco commented 1 year ago

The Institution as Agent looks great to me. Fits the model with no tweaking also.

Jegelewicz commented 1 year ago

@dustymc If this is true - we probably need a stronger link within Arctos between the collections and the institution agent?

dustymc commented 1 year ago

It's been handy to let the agent carry some of the stuff that the collection can't, so maybe. Parent-child relationships would do ^^ that stuff, and maybe whatever weird thing CHAS (at least) is doing by involving dates in the relationships (the collections have changed parents, maybe). Issue....