Open javier-molina opened 1 year ago
Analysis from Doug:
Doug 1 day ago The SiteVisit hanging off the dataset is concerning. The rest look right; a by-product of the rather wobbly structure of eBird where there are huge numbers of semi-orphan finds and site visits where the actual occurrences are all lumped together.
Doug 20 hours ago @Dave Martin (ALA) Attached are the child-parent event type stats for the eBird event.csv eventType p_eventType count SiteVisit Subsurvey 421824 SiteVisit Survey 1736543 Subsurvey Survey 203704 Survey 16 Find SiteVisit 724 Find Subsurvey 40755 Find Survey 262898 :man-shrugging::skin-tone-2: 1
Doug 20 hours ago I’m not seeing any SiteVisit that isn’t attached to a Survey or Subservey, so I think the load is a bit wonky.
Doug 19 hours ago Occurrences are mostyl connected to site visits
Doug 19 hours ago eventType count SiteVisit 36850861 Find 304285
Digging a bit a bit deeper, not all SiteVisit
events are attached to Survey
events after interpretation.
I think this is because the UniqueTransform
filters duplicates based on ID and duplicate events are in the original DwCA:
scala> spark.read.format("avro").load("/pipelines-data/dr2009/1/verbatim/*.avro").select(
"id",
"coreId",
"coreTerms.`http://rs.tdwg.org/dwc/terms/eventID`",
"coreTerms.`http://rs.tdwg.org/dwc/terms/parentEventID`"
).where("id='G2538453'").show(false)
+--------+------+------------------------------------+------------------------------------------+
|id |coreId|http://rs.tdwg.org/dwc/terms/eventID|http://rs.tdwg.org/dwc/terms/parentEventID|
+--------+------+------------------------------------+------------------------------------------+
|G2538453|null |G2538453 |6f6069ff-6cc6-49b8-a7fb-3c7d85389c68 |
|G2538453|null |G2538453 |a326f354-8344-460c-b329-26a1c407446a |
+--------+------+------------------------------------+------------------------------------------+
These records are filtered out as part of the interpretation by the UniqueIdTransform
scala> spark.read.format("avro").load("/pipelines-data/dr2009/1/event/verbatim/*.avro").select(
"id",
"coreId",
"coreTerms.`http://rs.tdwg.org/dwc/terms/eventID`",
"coreTerms.`http://rs.tdwg.org/dwc/terms/parentEventID`"
).where("id='G2538453'").show(false)
+---+------+------------------------------------+------------------------------------------+
|id |coreId|http://rs.tdwg.org/dwc/terms/eventID|http://rs.tdwg.org/dwc/terms/parentEventID|
+---+------+------------------------------------+------------------------------------------+
+---+------+------------------------------------+------------------------------------------+
cc @charvolant
These are 'Group IDs' matched onto Subsurvey. The different parentEventID (Survey) suggests that they're being duplicated across surveys. I'll look at disambiguating them.
eBird is looking as expected in the UI now.
From Dave on Slack
Doug im not sure the event hierarchy is quite right for eBird