AtlasOfLivingAustralia / extended-data-model

0 stars 0 forks source link

Review eBird dataset contents #90

Open javier-molina opened 1 year ago

javier-molina commented 1 year ago

From Dave on Slack

Doug im not sure the event hierarchy is quite right for eBird

image

javier-molina commented 1 year ago

Analysis from Doug:

Doug 1 day ago The SiteVisit hanging off the dataset is concerning. The rest look right; a by-product of the rather wobbly structure of eBird where there are huge numbers of semi-orphan finds and site visits where the actual occurrences are all lumped together.

Doug 20 hours ago @Dave Martin (ALA) Attached are the child-parent event type stats for the eBird event.csv eventType p_eventType count SiteVisit Subsurvey 421824 SiteVisit Survey 1736543 Subsurvey Survey 203704 Survey 16 Find SiteVisit 724 Find Subsurvey 40755 Find Survey 262898 :man-shrugging::skin-tone-2: 1

Doug 20 hours ago I’m not seeing any SiteVisit that isn’t attached to a Survey or Subservey, so I think the load is a bit wonky.

Doug 19 hours ago Occurrences are mostyl connected to site visits

Doug 19 hours ago eventType count SiteVisit 36850861 Find 304285

djtfmartin commented 1 year ago

Digging a bit a bit deeper, not all SiteVisit events are attached to Survey events after interpretation. I think this is because the UniqueTransform filters duplicates based on ID and duplicate events are in the original DwCA:

scala> spark.read.format("avro").load("/pipelines-data/dr2009/1/verbatim/*.avro").select(
  "id", 
  "coreId", 
  "coreTerms.`http://rs.tdwg.org/dwc/terms/eventID`", 
  "coreTerms.`http://rs.tdwg.org/dwc/terms/parentEventID`"
).where("id='G2538453'").show(false)

+--------+------+------------------------------------+------------------------------------------+
|id      |coreId|http://rs.tdwg.org/dwc/terms/eventID|http://rs.tdwg.org/dwc/terms/parentEventID|
+--------+------+------------------------------------+------------------------------------------+
|G2538453|null  |G2538453                            |6f6069ff-6cc6-49b8-a7fb-3c7d85389c68      |
|G2538453|null  |G2538453                            |a326f354-8344-460c-b329-26a1c407446a      |
+--------+------+------------------------------------+------------------------------------------+

These records are filtered out as part of the interpretation by the UniqueIdTransform

scala> spark.read.format("avro").load("/pipelines-data/dr2009/1/event/verbatim/*.avro").select(
   "id", 
   "coreId", 
   "coreTerms.`http://rs.tdwg.org/dwc/terms/eventID`", 
   "coreTerms.`http://rs.tdwg.org/dwc/terms/parentEventID`"
 ).where("id='G2538453'").show(false)

+---+------+------------------------------------+------------------------------------------+
|id |coreId|http://rs.tdwg.org/dwc/terms/eventID|http://rs.tdwg.org/dwc/terms/parentEventID|
+---+------+------------------------------------+------------------------------------------+
+---+------+------------------------------------+------------------------------------------+

cc @charvolant

charvolant commented 1 year ago

These are 'Group IDs' matched onto Subsurvey. The different parentEventID (Survey) suggests that they're being duplicated across surveys. I'll look at disambiguating them.

djtfmartin commented 1 year ago

eBird is looking as expected in the UI now.