gbif / model-material

Data model research focused on richer data for a material catalogue
7 stars 8 forks source link

Specify: Event data missing eventType #81

Closed timrobertson100 closed 1 year ago

timrobertson100 commented 1 year ago

The latest EVENT.csv does not load due to a missing eventType

From row 11,241 there seems to be missing data

bb2deddf-98cb-a120-1451-66e1e4a20878,b3cf407a-29af-44f2-af3c-f9a9d3fbe6c3,https://www.gbif.org/dataset/8f79c802-a58c-447f-99aa-1d6a0790825a,,,,,,,,,,,,,,,,,,,,,,,,,
b22a5d6e-469c-967c-6fef-96b6888df23a,ba351336-b24d-4cab-a2cb-4177e8190e94,https://www.gbif.org/dataset/8f79c802-a58c-447f-99aa-1d6a0790825a,,,,,,,,,,,,,,,,,,,,,,,,,
6179fef0-86ab-2e05-3219-94e118831cd7,c88ee689-5f2b-4601-822c-7a3249d32ce5,https://www.gbif.org/dataset/8f79c802-a58c-447f-99aa-1d6a0790825a,,,,,,,,,,,,,,,,,,,,,,,,,

Could you please look at this @acbentley?

acbentley commented 1 year ago

Hmm, not sure what those rows at the bottom were. I have removed them.

acbentley commented 1 year ago

File has been replaced

timrobertson100 commented 1 year ago

Thanks @acbentley but the IDs are in the occurrence file as well so without them the occurrence file becomes invalid.

If those are to be removed, they need to be removed from occurrence and anything linking to those occurrences would need to be removed.

acbentley commented 1 year ago

The IDs appear to be duplicated in the data above though so should still link correctly.

timrobertson100 commented 1 year ago

I'm afraid not - it removes IDs that are used in the occurrence table. The error provides an example key:

psql:load-specify.sql:27: ERROR:  insert or update on table "occurrence" violates foreign key constraint "occurrence_occurrence_id_fkey"
DETAIL:  Key (occurrence_id)=(bb2deddf-98cb-a120-1451-66e1e4a20878) is not present in table "event".
acbentley commented 1 year ago

OK, I have a note indicating that events needed to be added for all occurrences from the occurrence file. I have added these back in and transformed them. Does that work now?

timrobertson100 commented 1 year ago

I'm afraid not. Now there are duplicate records in Event.csv

psql:load-specify.sql:21: ERROR:  duplicate key value violates unique constraint "event_pkey"
DETAIL:  Key (event_id)=(b3cf407a-29af-44f2-af3c-f9a9d3fbe6c3) already exists.

Looking at the file, that key is in many times (with different location keys):

tsj442@YN13987 gbif % cat ../specify/Event/EVENT.csv|grep b3cf407a-29af-44f2-af3c-f9a9d3fbe6c3
b3cf407a-29af-44f2-af3c-f9a9d3fbe6c3,,https://www.gbif.org/dataset/8f79c802-a58c-447f-99aa-1d6a0790825a,a17a1150-49ac-4d8c-9466-fa61aa3cffe4,,OCCURRENCE,,001-RFM-04,5/11/2004,2004,5,11,,,,,,39:31.18 N,96:19.58 W,,,,,,,,,
b3cf407a-29af-44f2-af3c-f9a9d3fbe6c3,,https://www.gbif.org/dataset/8f79c802-a58c-447f-99aa-1d6a0790825a,bb2deddf-98cb-a120-1451-66e1e4a20878,,OCCURRENCE,,,,,,,,,,,,,,,,,,,,,,
b3cf407a-29af-44f2-af3c-f9a9d3fbe6c3,,https://www.gbif.org/dataset/8f79c802-a58c-447f-99aa-1d6a0790825a,0774f440-2bbe-c638-dfe0-cb843297e8ab,,OCCURRENCE,,,,,,,,,,,,,,,,,,,,,,
b3cf407a-29af-44f2-af3c-f9a9d3fbe6c3,,https://www.gbif.org/dataset/8f79c802-a58c-447f-99aa-1d6a0790825a,e1876adb-12bf-a52a-9e92-661b9423ebae,,OCCURRENCE,,,,,,,,,,,,,,,,,,,,,,
b3cf407a-29af-44f2-af3c-f9a9d3fbe6c3,,https://www.gbif.org/dataset/8f79c802-a58c-447f-99aa-1d6a0790825a,5b19f908-d5f6-8324-a6ae-19de53aed9e2,,OCCURRENCE,,,,,,,,,,,,,,,,,,,,,,
b3cf407a-29af-44f2-af3c-f9a9d3fbe6c3,,https://www.gbif.org/dataset/8f79c802-a58c-447f-99aa-1d6a0790825a,ca688bea-d19c-62e1-59e3-8afb97426473,,OCCURRENCE,,,,,,,,,,,,,,,,,,,,,,
b3cf407a-29af-44f2-af3c-f9a9d3fbe6c3,,https://www.gbif.org/dataset/8f79c802-a58c-447f-99aa-1d6a0790825a,e5b8e92c-f2a9-69c7-3d9c-15e71173b156,,OCCURRENCE,,,,,,,,,,,,,,,,,,,,,,
b3cf407a-29af-44f2-af3c-f9a9d3fbe6c3,,https://www.gbif.org/dataset/8f79c802-a58c-447f-99aa-1d6a0790825a,dc0e3767-f15d-61e6-15b1-0c02401a7353,,OCCURRENCE,,,,,,,,,,,,,,,,,,,,,,
b3cf407a-29af-44f2-af3c-f9a9d3fbe6c3,,https://www.gbif.org/dataset/8f79c802-a58c-447f-99aa-1d6a0790825a,0e455b50-bdd7-a756-4fd3-7f3a00ef0057,,OCCURRENCE,,,,,,,,,,,,,,,,,,,,,,
b3cf407a-29af-44f2-af3c-f9a9d3fbe6c3,,https://www.gbif.org/dataset/8f79c802-a58c-447f-99aa-1d6a0790825a,ccd81ea8-19a3-c6a2-2cd2-7ffb92154353,,OCCURRENCE,,,,,,,,,,,,,,,,,,,,,,
b3cf407a-29af-44f2-af3c-f9a9d3fbe6c3,,https://www.gbif.org/dataset/8f79c802-a58c-447f-99aa-1d6a0790825a,f92b5c1c-2d7e-cd38-533d-732f0be010bd,,OCCURRENCE,,,,,,,,,,,,,,,,,,,,,,
b3cf407a-29af-44f2-af3c-f9a9d3fbe6c3,,https://www.gbif.org/dataset/8f79c802-a58c-447f-99aa-1d6a0790825a,3b54566a-0a3a-712a-8cd8-a77174ab1496,,OCCURRENCE,,,,,,,,,,,,,,,,,,,,,,
b3cf407a-29af-44f2-af3c-f9a9d3fbe6c3,,https://www.gbif.org/dataset/8f79c802-a58c-447f-99aa-1d6a0790825a,ad4b891a-2b92-2ff8-5c67-a4c2cedeed03,,OCCURRENCE,,,,,,,,,,,,,,,,,,,,,,
acbentley commented 1 year ago

Yes, because the event is connected to multiple localities in a 1:M relationship in Specify.

acbentley commented 1 year ago

I am seriously confused now. Take a look at the occurrence file and let me know how to add occurrences to the event file, please. What does the occurrence_id equate to and am I supposed to create event_ids?

timrobertson100 commented 1 year ago

So an Occurrence in this model is a subtype of Event (i.e. an inheritance model, similar to the Entity and MaterialEntity). What this means is that for every Occurrence record, there must be an equivalent Event record having the same ID.

because the event is connected to multiple localities in a 1:M relationship in Specify

I don't think this model will accommodate that. An Event is something that happens within a place and time. I don't think an event can span multiple locations.

acbentley commented 1 year ago

Sorry, the other way around. The same locality is connected to multiple events - the same place is collected multiple times potentially by multiple people. So, the occurrence_id becomes the event_id in this scenario? Do I then need to connect these to the locality_ids? Is this still an event_type = OCCURRENCE?

timrobertson100 commented 1 year ago

Thanks @acbentley - that makes more sense.

So if I understand correctly, you'd have:

  1. Your Location records for where things were observed or collected
  2. Many Event records each with their own ID, time information (when the event occurred) and by whom - there can be many events at the same Location (same locationID)
  3. Those are events of Organism(s) at a place and time so they will have eventType of Occurrence
  4. There will be an Occurrence entry with the same ID as the Event (required for the inheritance model)
  5. For the Occurrence events where physical material was collected - not just observed - there would be an entry in the OccurrenceEvidence that joins the Occurrence to the Entity that was preserved.

I hope this makes some sense, but please do ask if not.

acbentley commented 1 year ago

Somewhat:

  1. Your Location records for where things were observed or collected

Yes, I have a location file that has my locality table IDs

  1. Many Event records each with their own ID, time information (when the event occurred) and by whom - there can be many events at the same Location (same locationID)

Yes, I have an event file that links Collection event (event_id) to location (locality_id)

  1. Those are events of Organism(s) at a place and time so they will have eventType of Occurrence
  2. There will be an Occurrence entry with the same ID as the Event (required for the inheritance model)

Correct. The occurrence file links an event_id to an organism_id

  1. For the Occurrence events where physical material was collected - not just observed - there would be an entry in the OccurrenceEvidence that joins the Occurrence to the Entity that was preserved.

This is where I am still a little lost. How do I map the IDs in the occurrence file to those in the event file in order to add them? Does the occurrence_id (which I had to create) become the event_id? What then is the location_id and what event_type is it?

Sorry for so many dumb questions but I think I am having issues resolving our specimen-based model to the UM event-based model.

tucotuco commented 1 year ago

I want to leave out the OccurrenceEvidence from this response. I'll get to that in another comment.

I think the issue with the Events is actually simple:

  1. Each Specify collecting event is a GUM Event. These look present and correct in EVENT.csv.
  2. Each Specify collection object is a GUM Organism/MaterialEntity/Entity. These look correct in the three corresponding csv files.
  3. For every Specify collection object there is a GUM Occurrence/Event with manufactured occurrenceIDs, because those don't exist in Specify. These look correct in the OCCURRENCE.csv table.
  4. The Event records for the Occurrences in the OCCURRENCE.csv file in the EVENT.csv file, but they are defective. They were actually more correct in the previous version of the EVENT.csv file. Every Occurrence record requires an Event record with the eventID=occurrenceID. I believe that right now you have all of the records, but you put the Specify collecting event ID in the eventID field for them instead of the occurrenceID. The values you put in the eventID for the Events that are Occurrences should go into the parentEventID instead of the eventID. That will make your collecting events the parents of the Occurrences that happened there. The locationIDs, verbatim fields, etc. for the Event records that are Occurrences would just be the same those of their parents Events. The eventType for the Event records that are Occurrences must be 'OCCURRENCE'. The eventType for the Event records that are collecting events in Specify should be "COLLECTING_EVENT" rather than "OCCURRENCE", which is what they have right now.
tucotuco commented 1 year ago

Sorry for so many dumb questions but I think I am having issues resolving our specimen-based model to the UM event-based model.

No worries. I think you had it mostly right in a previous export.

  1. For the Occurrence events where physical material was collected - not just observed - there would be an entry in the OccurrenceEvidence that joins the Occurrence to the Entity that was preserved.

This is where I am still a little lost. How do I map the IDs in the occurrence file to those in the event file in order to add them? Does the occurrence_id (which I had to create) become the event_id? What then is the location_id and what event_type is it?

The OccurrenceEvidence table is meant to allow you to say what evidence there is to assert the Occurrence. Since it links Occurrence to Entity, the evidence can be based on MaterialEntities and/or on DigitalEntities.

At least in this dataset, there are no parts of collection objects that are currently shared as MaterialEntities/Entities derived from the Organisms. This could happen, by a) integrating the tissue samples, and b) parsing the preparations. Of these tow, a) would be the most interesting to actually capture and demonstrate. Since there are (currently) no MaterialEntities other than the Organisms in this data set, the OccurrenceEvidence consists only of Organisms and the entries in the OccurrenceEvidence would consist of same occurrenceID, organismID combinations as are in the Occurrence table. Though this is redundant in terms of information, it makes it easier to populate an "Evidence" section of a UI by using the OccurrenceEvidence table directly instead of having to traverse multiple tables looking for things that may or may not be there.

acbentley commented 1 year ago

I want to leave out the OccurrenceEvidence from this response. I'll get to that in another comment.

I think the issue with the Events is actually simple:

  1. Each Specify collecting event is a GUM Event. These look present and correct in EVENT.csv.
  2. Each Specify collection object is a GUM Organism/MaterialEntity/Entity. These look correct in the three corresponding csv files.
  3. For every Specify collection object there is a GUM Occurrence/Event with manufactured occurrenceIDs, because those don't exist in Specify. These look correct in the OCCURRENCE.csv table.
  4. The Event records for the Occurrences in the OCCURRENCE.csv file in the EVENT.csv file, but they are defective. They were actually more correct in the previous version of the EVENT.csv file. Every Occurrence record requires an Event record with the eventID=occurrenceID. I believe that right now you have all of the records, but you put the Specify collecting event ID in the eventID field for them instead of the occurrenceID. The values you put in the eventID for the Events that are Occurrences should go into the parentEventID instead of the eventID. That will make your collecting events the parents of the Occurrences that happened there. The locationIDs, verbatim fields, etc. for the Event records that are Occurrences would just be the same those of their parents Events. The eventType for the Event records that are Occurrences must be 'OCCURRENCE'. The eventType for the Event records that are collecting events in Specify should be "COLLECTING_EVENT" rather than "OCCURRENCE", which is what they have right now.

So, I need to include location_id's for all of these? Seems a little redundant? That will involve a separate query out of Specify as I don't currently have a query that links all these things together.

tucotuco commented 1 year ago

It's redundant in much the same way that OccurrenceEvidence is in your case, but it makes things easier on the visualization and data assembly sides of things.

acbentley commented 1 year ago

My problem is that it is harder to link those since the occurence_ids have been manufactured. I will have to re-manufacture them if I need to create a new query that links collecting events to localities.

acbentley commented 1 year ago

OK, have regenerated and uploaded the OCCURRENCE file and EVENT file hopefully with the necessary connections.

timrobertson100 commented 1 year ago

Thanks @acbentley - I'm afraid this is still problematic.

psql:load-specify.sql:27: ERROR:  insert or update on table "event" violates foreign key constraint "event_parent_event_id_fkey"
DETAIL:  Key (parent_event_id)=(db191bf1-1ed3-11e3-bfac-90b11c41863e) is not present in table "event".

There are rows in the event table with parentEventID which don't exist.

acbentley commented 1 year ago

OK, I think I found the problem. Have uploaded a new EVENT file

timrobertson100 commented 1 year ago

Thanks @acbentley

This is very close now. There has been a change today where a missing foreign key was spotted.

In your material_entity table you have collection_id in the form of https://www.gbif.org/grscicoll/collection/e879eb7f-dda0-4c93-9b35-42e39010e8be. That needs to be the key in the collection file.

Please can you change material_entity collection_id to be 3f55b3fa-292d-4170-bd46-66dca41d7f05 and not ´https://www.gbif.org/grscicoll/collection/e879eb7f-dda0-4c93-9b35-42e39010e8be´?

Please can you change the grscicoll_id column in collection to be e879eb7f-dda0-4c93-9b35-42e39010e8be and not https://www.gbif.org/grscicoll/collection/e879eb7f-dda0-4c93-9b35-42e39010e8be

With those 2 changes, your data loads into the shared database

acbentley commented 1 year ago

New COLLECTION and MATERIAL_ENTITY files uploaded with those changes

timrobertson100 commented 1 year ago

Thanks @acbentley

The data loads now into the shared database. There may be additional things, but this is a really important milestone - thank you for all the edits.

The user interface is a work in progress, but you can see an example record here.