Closed MortenHofft closed 1 year ago
Let's call it a plea for information.
'tatertots!', -- not a clue but can't be null....
great - that sounds like something @tucotuco would find useful to know. That we have fields that either needs more documentation and/or that should allow null values.
I see it in EntityTable: dataset_id
... ohh! - that is it. I was convinced that I had seen it in multiple files. I was wrong. Let us just close this then
Where are all the places where we have tater tots? BTW, how I miss those.
On Tue, Mar 21, 2023, 10:02 dustymc @.***> wrote:
Let's call it a plea for information.
'tatertots!', -- not a clue but can't be null....
— Reply to this email directly, view it on GitHub https://github.com/gbif/model-material/issues/110#issuecomment-1477799498, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADQ725H7L3ISZEYJ2LB7LDW5GRF3ANCNFSM6AAAAAAWCGJYVU . You are receiving this because you are subscribed to this thread.Message ID: @.***>
gimme somma your tots!
OK, tatertots are for the dataset_id for DigitalEntities. @dustymc Is it nonsense to assign a media item to a dataset? Knowing the reasoning could be sufficient to relax the constraint, which was implemented specifically for this test.
I don't think I know what a "dataset" is, but probably - media aren't "owned" except by association, and that can involve many collections. "Arctos"??
@timrobertson100 asked for this field to be added (with that constraint), I believe as a way to be able to do dataset-level statistics and such, where it was thought that all of the participants would be dataset identifiers in GBIF already. As such, "Arctos" would not fit, as there is not "Arctos" dataset because they are divided up by institutions and collections. This case makes it clear to me that the constraint, at the very least, is not appropriate.
I'm happy to see the constraint removed, but I'll try and explain the motivation for why it was there.
Data is and will be shared from various sources and integrated. Our clustering detects some of this now, but we don't project the data onto a sensible structure (i.e. we only link records from DwC-A). Probably this approach for a datasetId
was too naive, but it was an attempt to be able to at least identify from which source the data originated necessary for attribution, connecting people (who is the technical contact I can talk to about this data), and diagnostics.
Bearing in mind this is geared towards dataset exports (i.e. how GBIF currently operates) and not a live data environment... if you have an idea of how the data might be packaged (by collection, by project, all in one Arctos dataset etc) I'd suggest trying to put in e.g. arctos:collection:<collectionCode>
or similar for the time being. In practice, whatever is integrating the data would likely be responsible for populating it.
Does that seem reasonable, please?
Speaking only for myself and in general: Please just use our identifiers. https://arctos.database.museum/collection/MSB:Mamm
, https://arctos.database.museum/guid/MSB:Mamm:259087
, https://arctos.database.museum/media/10310948
, etc are unambiguous in any environment. If something we don't have needs added for "DWC reasons" or to support others or WHATEVER that's fine, but it's not what I'd like to see cited.
@dustymc I think the issue here is our "media dataset" which doesn't have an "Arctos identifier". It seems fine to make one for the purposes of this exercise? arctos:media:<media>
? Or maybe I am misunderstanding what is being asked.
It seems fine to make one for the purposes of this exercise? arctos:media:
?
What is <media>
(literal??)
Yes, adding some ephemeral identifier to WHATEVER is fine, I'd just not like to have to try to make sense of it later, and I'd really like to not make anyone think it's a great thing with which to provide attribution.
What is
(literal??)
Good question. I don't know where to find definitions for all of this stuff. I don't see an equivalent in GBIF, but in GRSciColl I see this:
https://www.gbif.org/grscicoll/collection/b20e56a9-1687-4adb-985b-8d8947b7f1ba
Which feels wrong and seems like it should be MSB:Fish
Anyhow, for the dataset above, what would @timrobertson100 expect to see in dataset_id
?
I would expect https://arctos.database.museum/collection/MSB:Fish
are we on the same page? And if so, could we just use arctos:media
as the dataset_id
for our media "dataset"?
I would expect https://arctos.database.museum/collection/MSB:Fish
See https://github.com/gbif/model-material/issues/110#issuecomment-1480152192 - the media object may be used by 97 collections, not only one. (Zero is another common value - eg lots of habitat images are linked to collecting events which are used by records in collections instead of directly attached to the records.)
Yes, I understand - the Fish thing is probably just confuting the issue?
Because we do not have a "media" dataset, let's just use "Arctos:Media" or something like that.
I think there is a conflation of Dataset (one example and another example) and Collection (example) going on. In the Unified Model scenario, there is just one Arctos dataset and all of the data that come in that dataset should have the same dataset identifier. No need to be fancy, it would be fine to use "Arctos" (or "tatertots").
If you agree, no need to read any further. I'm just going to try to justify what I already said.
Except that this exercise is really to investigate a Material Catalogue, I think it is also artificial to have a dataset_id on Entity and not on any other class.
In the DwC scenario, we published datasets for every Arctos collection via IPT. A Dataset was produced (and versioned) each time the data were published. The Collection wasn't versioned, it was a Collection, not a Dataset. There is nothing in the broader data publishing scenario that says a Collection must be in only one dataset, nor that a Dataset must consist of data from only one Collection. They are different concepts. I think the problem in this exercise is trying to make the dataset_id do the job of both.
They are different concepts.
Yes, and I've known that from time to time, probably....
We've also done some weird thing where Dave stands on one foot facing East and prefixes a bunch of whatevertheyare with "Arctos: ImSureStringsWillWorkHereEvenThoughThatnNeverHappenedBeforeEver:" or somesuch nonsense in a probably-futile attempt to -- uhhh, I'm not actually sure why, but it probably seemed like a good idea to someone at the time, and I will admit to finding it handy to be able to filter (ish, probably) Arctos collections in the IPT. Anyway, I think we'd like to have another identifier.
I'll change tatertots! to https://arctos.database.museum for next export.
institution - there's no such thing in Arctos
yes and no? Every collection has an "Institution"
True - there's a string (couple of them maybe), there's not a data object or even any attempted control of the string.
I'll change tatertots! to https://arctos.database.museum/ for next export.
Thanks, @dustymc - for this and all the other interactions.
I think it is also artificial to have a dataset_id on Entity and not on any other class.
Agree and it's actually why I said a "naive attempt" above. I'm hoping to demonstrate some links detected from clustering in the not-too-distant future and the entity.datasetID
is suffice for that.
there's not a data object or even any attempted control of the string.
Perhaps we need an institution metadata page? Or is the agent good enough? https://arctos.database.museum/agent/21334566
The Institution as Agent looks great to me. Fits the model with no tweaking also.
@dustymc If this is true - we probably need a stronger link within Arctos between the collections and the institution agent?
It's been handy to let the agent carry some of the stuff that the collection can't, so maybe. Parent-child relationships would do ^^ that stuff, and maybe whatever weird thing CHAS (at least) is doing by involving dates in the relationships (the collections have changed parents, maybe). Issue....
@dustymc I'm not sure if it is a joke due to missing data or a mistake. But the value
tatertots!
appear a lot in various files