Closed ewommack closed 2 years ago
Yes, agree we need to provide some of our use cases, especially entities and also presence/absence data - which we capture for parasites, need to develop for pathogens, environmental DNA.
Also see issue raised by David Shorthouse on the need for stable identifiers and the discussion at https://github.com/gbif/pipelines/issues/677 Tim responded by saying there are discussions that GBIF will need to get into the business of minting stable identifiers. They will continue to recognize stable IDs and urls from providers.
Link to discussions on Discourse: https://discourse.gbif.org/c/data-publishing/gbif-data-model/26
See https://github.com/ArctosDB/internal/issues/168
I don't think that includes entities because they are (or should be!) fairly trivial eg GBIF already handles them reasonably well.
Leave an example of the parasite thing over there and I'll include it in the next iteration.
I don't think that includes entities because they are (or should be!) fairly trivial eg GBIF already handles them reasonably well.
From the webinar it didn't sound like they were really planning anything that would include or work with entities. When you have one that occurs across multiple places in time for multiple parts and uses, but is all the same individual?
I could see some of ours objects and entities spanning across several of the data models they are developing as well. An eagle caught on a camera trap, that also has a blood sample that was genetically sequenced, has an egg specimen from a nest that has digital media attached to, and is part of a taxonomic checklist from ebird that the researcher used to track occurrence of point counts at a site when doing a survey.
Entities can be just about anything, at least for now all we need to care about is those used for Organisms.
https://dwc.tdwg.org/list/#dwc_organismID (splat-f organismID because for some crazy reason that thing eats anchors) is the concept.
occurs across multiple places in time
I think that fits OK in the GUM.
From a slightly different perspective, those places and times are derived data - they're something that some Organism's "child" records (Occurrences for now, maybe something a bit more generic going forward) have done, and the Organism itself is just a unifying identifier.
I think there's some text about organisms in the docs linked from the internal issue.
need for stable identifiers
The not-Occurrences model would completely change that landscape for Arctos. We're currently providing OccurrenceIDs, which we make up on the fly as the price of admission for using the Occurrence model. They have no stability because they're not "real." (They are persistently resolvable, however - they still lead to catalog records even when the things we use to generate them have changed.) Getting out of that model should move us towards something where our actual stable resolvable identifiers assigned to the things we actually catalog are central.
presence/absence data - which we capture for parasites,
We should definitely pose this use case. We have the data to model as well, although it is maybe not all we wish it could be, doing this process will hopefully make it better!
Yes, and the MSB and MEPA communities are actively pushing for development of a pathogen model, so I will post that separately as well since we need to initiate that discussion in Github. @jldunnum
Anything I can help with here? In the hope that some comments might help...
In the opening comment, my email should be gtuco.btuco@gmail.com.
It sounds like Arctos Entities are comparable to GUM EntityOfInterests. There is no constraint on the type of Entity that might be. The vocabulary for EntityType is wide open.
I would be happy to help come up with use cases. The ones we want to develop initially should bring a new challenge that can't be met with the Darwin Core Archive star schema. However, we are already thinking of developing use cases that bring nothing new to the GUM model, but that would help a particular community understand how a particular problem is resolved with the GUM and an associated publishing model. An example of this is the treatment of the combination of an ocean trawl with the lots of fish-like things that come from that and further individual preparations within the lots that are not separated out as individual specimens with their own identifiers. The model has no problem handling that, but people might have a problem handling that with the model if it isn't demonstrated. The same criterion could be applied to potential new use cases.
BTW, our current highest priority for development is enshrined in the Arctos down with Occurrences issue.
Though they do seem really overwhelmed as well in the discussion. They said they wanted more, and then the next sentence they say they haven't caught up with things and not gotten through what they have.
I'd like to clarify this. We are not overwhelmed, we are prioritizing and working on several levels at once using the agile development paradigm. That's why a version of the IPT can already publish Camera Trap DP data in the Frictionless Data format even while we are still ironing out the GUM. We began with a list of use cases that had the potential to each bring something new to the GUM. We have developed 11 of those to the stage of being ready for public review. With those 11 in hand, we are taking some further quickly because we have excellent engagement for the interested community, while others are awaiting review by stakeholders, others aren't written yet, and others are being solicited. With two of us, and only me doing the modeling, we have to choose wisely and be efficient. Wisely means there is obvious potential for the impact/effort ratio.
Presence/absence of parasites is definitely novel, as the location in that case is an Organism (with its geographic location, or course), but the point isn't so much that where the parasite occurs geographically as the parasite load, though the former is tractable from the locations of the hosts. It also may be nicely integrated with biotic interactions (use case which see).
From the webinar it didn't sound like they were really planning anything that would include or work with entities. When you have one that occurs across multiple places in time for multiple parts and uses, but is all the same individual?
Maybe I don't understand Arctos entities, but if I do, the EntityOfInterest in the GUM is the corollary. In many use cases at least on of the types of EntityOfInterest is an Organism. Sometimes it is an identified Organism (one you can point to), sometimes it is some Organism (a proxy because you don't have any persistent evidence). So, in the GUM, Organisms running around can be EntityOfInterests on multiple occasions (Events) with distinct evidence each time. If people get their identifiers in order, we could even put them all together across independent data sets in a view on the Event history of that Organism and its evidence. In principle an EntityOfInterest can be anything. And there can be proximate and ultimate EntityOfInterests. A proximate one could be a mouse from a trap (entityType "dwc:Organism") while another might be the small mammal species diversity of a park (entityType "geographic species diversity" or something). If I have Arctos entities wrong, I apologize and will do my best to rectify the situation with assistance welcome.
[entity] occurs across multiple places in time
I think that fits OK in the GUM.
It does. Very well, if you identify (put an identifier on) your entity.
From a slightly different perspective, those places and times are derived data - they're something that some Organism's "child" records (Occurrences for now, maybe something a bit more generic going forward) have done, and the Organism itself is just a unifying identifier.
Exactly.
The not-Occurrences model would completely change that landscape for Arctos. We're currently providing OccurrenceIDs, which we make up on the fly as the price of admission for using the Occurrence model. They have no stability because they're not "real." (They are persistently resolvable, however - they still lead to catalog records even when the things we use to generate them have changed.) Getting out of that model should move us towards something where our actual stable resolvable identifiers assigned to the things we actually catalog are central.
Yes, such as identifiers for MaterialEntities. GBIF is keen on building a Material index those that came in with persistent resolvable identifiers would play best in that sandbox.
Presence/absence of parasites is definitely novel, as the location in that case is an Organism (with its geographic location, or course), but the point isn't so much that where the parasite occurs geographically as the parasite load, though the former is tractable from the locations of the hosts. It also may be nicely integrated with biotic interactions (use case which see).
A great opportunity to think about "collecting" event in multiple ways.
Maybe I don't understand Arctos entities
You do, they're just identifiers. (Not really, but that's probably close enough for those used as Organisms at the moment.)
even put them all together across independent data sets
Exactly. Entities do nothing(ish) new within Arctos, but they're prettier (maybe) in spreadsheets, and you can grab one, stick it in your Excel database, and semi-automagically join the party at GBIF (or anywhere Organisms are compiled).
identifiers for MaterialEntities
FYI that's a social problem at this point - they're actually functional in Arctos, but I need some committed buy-in before I can share them (without making Arctos look broken - and rightly so - when they vaporize).
think about "collecting" event in multiple ways
Maybe I'm not seeing something obvious, but that entire discussion seems utterly incapable of crossing into reality from here. If we recorded all events to the cubic centimeter and second that might almost be not-quite-realistic, but the reality is that we have things like "Indiana, before tomorrow" and that simply cannot be used to stitch hosts and parasites (or much of anything else) back together. Fortunately we have no need to make event-based inferences: we can just make direct unambiguous assertions.
think about "collecting" event in multiple ways
Maybe I'm not seeing something obvious, but that entire discussion seems utterly incapable of crossing into reality from here. If we recorded all events to the cubic centimeter and second that might almost be not-quite-realistic, but the reality is that we have things like "Indiana, before tomorrow" and that simply cannot be used to stitch hosts and parasites (or much of anything else) back together. Fortunately we have no need to make event-based inferences: we can just make direct unambiguous assertions.
I am talking about the "collection" of the parasite from the host which is an event of a sort that we do not currently track.
Adding important links here
Discourse Diversifying the GBIF data model Webinar recording
Parasites might be a narrative for 14: Humboldt Core Monitoring and Absence data - we examined and either did or did not find them...
@ewommack re your catalog question, this is from the Diversifying the GBIF data model document:
Catalogue services This section is not yet drafted.
It is envisaged that this section will cover the following: What is meant by a catalogue service Proposals for potential catalogues and what they could enable. These may include for example: A material catalogue enabling search and access of physical material such as specimens Absence data, suitable for downloading as a dataset for use in modelling Exploration of sites, such as long term monitoring plots and the datasets that are available to access Exploration of the events that result in the evidence for asserting a species occurring at a place and time Location based gazetteers and services to support georeferencing activities and data reporting Exploration of species interactions - or evidence of them Richer dataset search allowing for discovery of datasets that cover specific concepts (e.g. a particular kind of measurement) PIDs for supported concepts Linked Open Data resolution for supported concepts
17: Palaeontology, Zooarchaeology or Archaeobotany related topics
@Nicole-Ridgwell-NMMNHS @aklompma @cefilipek @lmtabak @wellerjes @mvzhuang any interest in coming up with paleo use cases?
Agents has been mentioned as a potential catalog - might be a relatively easy way to have an impact.
From the GBIF Community Webinar on the proposed Grand Unifying Model, they suggested that they are looking for new Use Cases. A couple things suggested that I thought might be things specific from Arctos that might be a good idea to push forward would be:
To recommend new Use Cases, contact through: trobertson@gbif.org, stucco.btuco@gmail.com or bit.ly/data-model-forum
Though they do seem really overwhelmed as well in the discussion. They said they wanted more, and then the next sentence they say they haven't caught up with things and not gotten through what they have.