gbif / pipelines

Pipelines for data processing (GBIF and LivingAtlases)
Apache License 2.0
40 stars 28 forks source link

Implement AVRO schemas and beam transforms for prototype extended data model #694

Open djtfmartin opened 2 years ago

djtfmartin commented 2 years ago

This is an epic to help track the implementation of the prototype extended data model in pipelines. The more detailed prototype model with fields is here (developed by @charvolant): https://user-images.githubusercontent.com/444897/159949918-18f7e168-ce82-45ea-8aca-eb0d18b1bba4.svg

New AVRO schemas

New AVRO schemas are required for the following entities:

The current EventCoreRecord AVRO schema should suffice to contain the dataset information associated with the event in a denormalized fashion.

The Multimedia AVRO schema can be reused in the event pipelines.

Tasks

timrobertson100 commented 2 years ago

Edited, as issue reworked to address this

~Before this goes too far in implementation, I'd encourage a review against the work GBIF is doing on diversifying the data model. See slide 4 and case study 12 and case study 1 in particular. There will be a public webinar on the GBIF model on the 12th of April (tentative date still).~

~I don't think there is anything in the diagram above that isn't covered in the wider model, and it would likely be better to align this early in the process if it makes sense to do so.~

~Some areas that we might consider are splitting MOF into the quantitative and qualitative assertions, using the same terminology throughout (e.g. MaterialEntity, Location etc), whether we want concrete subclasses (SiteVisit) or just events with an eventType, and whether we want to consider separating the physical/digital material linked to the events that produced them, rather than having material (sample) being a subclass of event. That model analysis may have happened already in ALA, but I think it would be good to get those involved in modeling together first.~

charvolant commented 2 years ago

On the whole, I don't think we'll need to do much in the way of special AVROness. My suggestion is that Survey, SiteVisit, Sample, Observation etc. are all rolled into a common event schema with an eventType and some validation rules on how parent and child events hang together.

Similarly, I expect whatever is in Occurrence to be handled the same way that occurrences are handled in the current pipeline.

I'm not sure about ExtendedMeasurementOrFact. I'm looking through the pipelines trying to work out whether they can be easily attached to event or to an occurrence record but I believe that this is already done.

The major difficulty that I foresee is percolating information down the event hierarchy and out to occurrences. It looks like an iterative process.

timrobertson100 commented 2 years ago

My suggestion is that Survey, SiteVisit, Sample, Observation etc. are all rolled into a common event schema with an eventType and some validation rules on how parent and child events hang together.

Edited, as issue reworked to address this

~Thanks, @charvolant, I think that is more in line with the kind of thing I had anticipated. I know we want to be pragmatic here and move quickly with demonstrations, but perhaps a few things we might consider before implementing the Avro's suggested in the original post:~

~Would it make sense to you that we review this together with @tucotuco perhaps?~

The major difficulty that I foresee is percolating information down the event hierarchy and out to occurrences. It looks like an iterative process.

Yes, I had also wondered if Beam would be sufficient for this for that reason.

tucotuco commented 2 years ago

Just quickly because I have to head out for another day in the field, The summary @timrobertson100 gave in the previous comment looks right on target.

timrobertson100 commented 2 years ago

There is now a proposal for the eventType term in DwC

timrobertson100 commented 2 years ago

The schema registry sandbox now has event core with the eventType term. The https://ipt.gbif.org/ installation used for testing has this updated for use. I have informed the Humboldt task group, who are similarly preparing exemplar datasets while testing their extension. If we have any example datasets, that are OK to go on public URLs while we explore this, please ping me and I can load them into the test IPT.

rubenpp7 commented 2 years ago

Hi @timrobertson100

You can explore http://ipt.gbif.pt/ipt/resource.do?r=benthic-alentejo-2011 . Currently this dataset documents its event hierarchy type using the "type" field (common practice in EurOBIS datasets due to the lack of a better fitting DwC term). Ideally, the information in this field ("sample", "station", "cruise") should be moved to the new eventType field.

Please let me know if this example dataset fits your purpose :)

timrobertson100 commented 2 years ago

Thank you @rubenpp7 - we had wondered if anyone was doing this already. We'll look at that.

ymgan commented 2 years ago

Hey Tim, Ruben, thank you two so much for this!! I just wanted to leave this issue and the paper here because using the "type" field this way has been around since 2017 and I thought it could be interesting for the use case:

https://github.com/iobis/env-data/issues/4 http://bdj.pensoft.net/articles.php?id=10989&instance_id=3385375

djtfmartin commented 2 years ago

This is old issue created at the start of the project. It has been split into several more specific issues. Marking as done.