Open djtfmartin opened 2 years ago
Edited, as issue reworked to address this
~Before this goes too far in implementation, I'd encourage a review against the work GBIF is doing on diversifying the data model. See slide 4 and case study 12 and case study 1 in particular. There will be a public webinar on the GBIF model on the 12th of April (tentative date still).~
~I don't think there is anything in the diagram above that isn't covered in the wider model, and it would likely be better to align this early in the process if it makes sense to do so.~
~Some areas that we might consider are splitting MOF into the quantitative and qualitative assertions, using the same terminology throughout (e.g. MaterialEntity, Location etc), whether we want concrete subclasses (SiteVisit) or just events with an eventType, and whether we want to consider separating the physical/digital material linked to the events that produced them, rather than having material (sample) being a subclass of event. That model analysis may have happened already in ALA, but I think it would be good to get those involved in modeling together first.~
On the whole, I don't think we'll need to do much in the way of special AVROness. My suggestion is that Survey, SiteVisit, Sample, Observation etc. are all rolled into a common event schema with an eventType and some validation rules on how parent and child events hang together.
Similarly, I expect whatever is in Occurrence to be handled the same way that occurrences are handled in the current pipeline.
I'm not sure about ExtendedMeasurementOrFact. I'm looking through the pipelines trying to work out whether they can be easily attached to event or to an occurrence record but I believe that this is already done.
The major difficulty that I foresee is percolating information down the event hierarchy and out to occurrences. It looks like an iterative process.
My suggestion is that Survey, SiteVisit, Sample, Observation etc. are all rolled into a common event schema with an eventType and some validation rules on how parent and child events hang together.
Edited, as issue reworked to address this
~Thanks, @charvolant, I think that is more in line with the kind of thing I had anticipated. I know we want to be pragmatic here and move quickly with demonstrations, but perhaps a few things we might consider before implementing the Avro's suggested in the original post:~
eventType
to the event records and a controlled vocabulary (siteVisit
, materialCollection
etc)~MaterialEntity
as the result of some of the events. Like the Event, Material is hierarchical. This would allow us to refer to the likes of soil that has been collected for subsequent DNA processing, forming the evidence for the resulting species occurrence.~~Would it make sense to you that we review this together with @tucotuco perhaps?~
The major difficulty that I foresee is percolating information down the event hierarchy and out to occurrences. It looks like an iterative process.
Yes, I had also wondered if Beam would be sufficient for this for that reason.
Just quickly because I have to head out for another day in the field, The summary @timrobertson100 gave in the previous comment looks right on target.
There is now a proposal for the eventType
term in DwC
The schema registry sandbox now has event core with the eventType
term. The https://ipt.gbif.org/ installation used for testing has this updated for use. I have informed the Humboldt task group, who are similarly preparing exemplar datasets while testing their extension. If we have any example datasets, that are OK to go on public URLs while we explore this, please ping me and I can load them into the test IPT.
Hi @timrobertson100
You can explore http://ipt.gbif.pt/ipt/resource.do?r=benthic-alentejo-2011 . Currently this dataset documents its event hierarchy type using the "type" field (common practice in EurOBIS datasets due to the lack of a better fitting DwC term). Ideally, the information in this field ("sample", "station", "cruise") should be moved to the new eventType field.
Please let me know if this example dataset fits your purpose :)
Thank you @rubenpp7 - we had wondered if anyone was doing this already. We'll look at that.
Hey Tim, Ruben, thank you two so much for this!! I just wanted to leave this issue and the paper here because using the "type" field this way has been around since 2017 and I thought it could be interesting for the use case:
https://github.com/iobis/env-data/issues/4 http://bdj.pensoft.net/articles.php?id=10989&instance_id=3385375
This is old issue created at the start of the project. It has been split into several more specific issues. Marking as done.
This is an epic to help track the implementation of the prototype extended data model in pipelines. The more detailed prototype model with fields is here (developed by @charvolant): https://user-images.githubusercontent.com/444897/159949918-18f7e168-ce82-45ea-8aca-eb0d18b1bba4.svg
New AVRO schemas
New AVRO schemas are required for the following entities:
Occurrence
- theBasicRecord
AVRO schema is roughly the equivalent of anOccurrence
entity in the new extended data model. Suggest this is used as a starting point, but it also needs to support references to theEventCoreRecord
.ExtendedMeasurementOrFact
- possibly 2 schemas, 1 for quantitative and 1 for qualitative assertionsThe current
EventCoreRecord
AVRO schema should suffice to contain the dataset information associated with the event in a denormalized fashion.The
Multimedia
AVRO schema can be reused in the event pipelines.Tasks
eventType
as a new term to DwC (@tucotuco)eventType
to to Event schema for DwC-A in the rs.gbif.org namespace (@timrobertson100)eventType
vocabulary (@charvolant)eventType
Occurrence
AVRO schemaExtendedMeasurementOrFact
AVRO schema