gbif / pipelines

Pipelines for data processing (GBIF and LivingAtlases)
Apache License 2.0
40 stars 28 forks source link

Event core w. occurrence extension - inheritance of occurrence search parameters? #878

Open CecSve opened 1 year ago

CecSve commented 1 year ago

This issue is based on a question we recently got on help desk, where a publisher published a DwC-A based on an event core, with parent and child events, and with occurrences as an extension: https://www.gbif.org/dataset/3bf0d17c-98a9-41a5-af43-5fe2719cc19b. The issue is also described here: https://github.com/iobis/manual/issues/73.

For the publisher, it makes sense to have the coordinates associated with the parent event and this is inherited by the unpopulated child events - however the occurrences does not inherit the coordinate information and therefore appear to not have coordinates associated in the UI and through searches: image

The whole issue is visualized here: image

How could we make sure the information is not lost in searches and in the UI - could either of this options plausible, or would another option be possible:

  1. Can occurrence-relevant fields be inherited from both parents event and child events in the interpretation process? - in this case, coordinates from parent events and dates from children events when the dataset is ingested?

Potential issue: I am not sure about the mechanism here, but would this generate artificially big downloads for user with redundant information across files in the archive?

  1. Can occurrence-relevant fields be inherited from both parents event and child events in the front end? This would not create multiple fields in the archives, but still make the occurrences searchable and visible. EDIT: this does not seem like a plausible solution because the UI depends on the API, right?

Potential issue: how would this affect users that download data by the API? Would the associated occurrences not still appear to not have coordinates etc.?

@fmendezh @ManonGros @MortenHofft

tucotuco commented 1 year ago

In case it has any bearing on decisions, in the Unified Model an Occurrence is an Event. The Location and time information is expected to be populated if the Location and time components are any different from the immediate parent. If they are not populated, since they must, by definition, lie within the spatiotemporal confines of the immediate parent, a parent Location and time could be propagated downward with a flag saying that it was done, with the understanding that child Locations and times may not be as specific in the shared data as they were in the original. The same applies all of the way up the Event hierarchy.

CecSve commented 1 year ago

Thanks @tucotuco - I expect the event interpretation will go into prod before the new data model, so this issue was mostly created to make sure those changes make sense for publisher and users with the new implementation. However, I have a follow up question to what you wrote:

The Location and time information is expected to be populated if the Location and time components are any different from the immediate parent.

So Location information of the child will be overwritten with the information of the parent even though they contain their own Location information?

timrobertson100 commented 1 year ago

The Location and time information is expected to be populated if the Location and time components are any different from the immediate parent.

So Location information of the child will be overwritten with the information of the parent even though they contain there own Location information?

I think John only means that a data publisher would be expected to populate the location/time of any sub-events that differ from parents. The pipelines can copy parent event data in, only when it is null on the child - i.e. not overwrite.

tucotuco commented 1 year ago

Exactly what @timrobertson100 said.

timrobertson100 commented 1 year ago

I've been thinking about this and can't shake a feeling it might trip us up. I'll try and explain to prompt further discussion.

I'm worried that we haven't been sufficiently clear in the documentation about the expectations when using parentEventID. In most cases, it's natural to assume the inheritance of information is reasonable (i.e. nested set theory) as we're doing here. However, there are many fields on the parent object and we haven't stated explicitly which can be inherited (e.g. fieldNotes?).

When we introduce inheritance it becomes impossible to nullify data on the child records. I recall a discussion of an idea to model a tracking dataset where a parent event represented the deployment of a tracking device (with date and location), with child events capturing the location as the organism moves. Here nested set theory breaks and additionally, it was noted that sometimes due to false readings, locations on a child sampling event need to be nullified with a remark as to why. What is proposed here would then copy in the wrong location on those children. I can't think of another example beyond movement that might exhibit this, but we know from experience that people do creative things with DwC so it might pop up again.

It's likely contrary to others but I would probably err on the side of caution if it were my data and be explicit on the child records in a DwC-A (i.e. repeat information) wherever possible to minimize reliance on interpretation. File-level compression would take care of the data sizes.

tucotuco commented 1 year ago

Not contrary to my way if thinking. Explicit trump's file size. Also, I don't think your example is the only one. As we have learned with the modelling work, having more than one way of doing things is a recipe for confusion and best avoided if possible

CecSve commented 1 year ago

Thanks @timrobertson100 and @tucotuco for expanding on why it would be troublesome to introduce too much inheritance of fields across data tables. In our follow up discussion for the technical support hour for nodes, there was raised some concerns about the point of having parent and child events if (some relevant) information could not be inherited to occurrences, and adding the information multiple times will lead to more work for publishers and nodes.

It could be worth investigating, and clearly describing, which information could be inherited OR at least make it clear which information should go where and why - in relation to your statement here, @timrobertson100:

I'm worried that we haven't been sufficiently clear in the documentation about the expectations when using parentEventID. In most cases, it's natural to assume the inheritance of information is reasonable (i.e. nested set theory) as we're doing here. However, there are many fields on the parent object and we haven't stated explicitly which can be inherited (e.g. fieldNotes?).

Whether it is in the context of the new data model or the current publishing system.

jdpye commented 1 year ago

I've been thinking about this and can't shake a feeling it might trip us up. I'll try and explain to prompt further discussion.

I'm worried that we haven't been sufficiently clear in the documentation about the expectations when using parentEventID. In most cases, it's natural to assume the inheritance of information is reasonable (i.e. nested set theory) as we're doing here. However, there are many fields on the parent object and we haven't stated explicitly which can be inherited (e.g. fieldNotes?).

When we introduce inheritance it becomes impossible to nullify data on the child records. I recall a discussion of an idea to model a tracking dataset where a parent event represented the deployment of a tracking device (with date and location), with child events capturing the location as the organism moves. Here nested set theory breaks and additionally, it was noted that sometimes due to false readings, locations on a child sampling event need to be nullified with a remark as to why. What is proposed here would then copy in the wrong location on those children. I can't think of another example beyond movement that might exhibit this, but we know from experience that people do creative things with DwC so it might pop up again.

It's likely contrary to others but I would probably err on the side of caution if it were my data and be explicit on the child records in a DwC-A (i.e. repeat information) wherever possible to minimize reliance on interpretation. File-level compression would take care of the data sizes.

I was thinking about this a tiny bit, mostly about the technical perspective as a data provider. I like the idea of inheritance-by-default, except where the user overrides, but what if we provide the 'null' option for these attributes not to indicate we -want- an inherited event to dictate location/time but that we want to prevent the inheritance of location at that sub-event/occurrence. The lack of a field would be the indicator that tells us to inherit.

Structure remains the same, the default behaviour is the one everyone assumes, arguably not too terrible a technical lift, but the publishers and data providers who are deep in the datasets can indicate the proper behaviour where they know better?

albenson-usgs commented 1 year ago

We were discussing this some at a working group meeting today and I'm struggling with what parentEvent provides if you can't inherit the information for the child items. This is how OBIS handles this and how it's documented in the OBIS manual also see here. It's problematic that the data end up looking different on GBIF vs. OBIS because of the difference in interpretation.

ymgan commented 1 year ago

It's problematic that the data end up looking different on GBIF vs. OBIS because of the difference in interpretation.

I completely agree!

I also think that this is a REAL challenge! I often see accidentally incremented sequences because data provider dragged the value for the column when using Excel. Having inheritance to take care of this could reduce the burden of the data providers and human errors.

On the other hand, null value can also mean different things. For example, I will use footprintWKT for the cruise track or bounding box of the voyage at the parent Event. For subsequent child Events, I use decimalLatitude, decimalLongitude and coordinateUncertaintyInMeters to represent coordinates (point data) for the sampling site so I usually leave the footprintWKT blank because I think it is sufficient to represent the point data with these 3 fields. In this case, I see footprintWKT as "unnecessary to be filled in" and leave them blank. If footprintWKT for the child Events are inherited from parent Event because they are blank, then this will lead to inconsistency within the child Event records - i.e. the decimalLatitude, decimalLongitude and coordinateUncertaintyInMeters are NOT the centroid and cUIM of the inherited footprintWKT of cruise track/bounding box of the voyage but simply the coordinate and cUIM of the sampling site.

The other challenge is when user comes to GBIF/OBIS to download data which could comprise of subset of records (which may exclude the parent Event) from multiple datasets. If aggregators like GBIF/OBIS do not fill in the "inherited values" at the child Events and they are blank, then I am not sure if the users will actually download the source archive of the dataset to look at the parent Event and determine which field should inherit the values. If the aggregators fill in the "inherited values" then this has to be done correctly to avoid inconsistency within the rows. So in this case, it maybe better to have as many fields filled out as possible by data provider/publisher.

I see both sides and it is not clear which fields should inheritance be applied to. Hence I appreciate @CecSve 's comment

It could be worth investigating, and clearly describing, which information could be inherited OR at least make it clear which information should go where and why ...