Open peterdesmet opened 2 years ago
I can't answer anything about timing at GBIF, but the Event-based mode 2 would have broader impact than either of the other two models, so I would recommend that GBIF enable that in any case.
Originally posted by @muttcg in https://github.com/gbif/portal-feedback/issues/4217#issuecomment-1217961207
@peterdesmet I think you need to re-link the data to fix the issue.
Event with ID 4c1e45dd-51d5-4e2f-9bbf-c07d76acfc1c has no location information and is has some "image" type event-4c1e45dd-51d5-4e2f-9bbf-c07d76acfc1c.txt
DWCA uses meta.xml to describe relationship between files, core and extensions linked to core via coreid, dwca reader uses id and coreid fields to link core file with extensions files. If you want to have one event record with multiple occurrences and multiple multimedia linked to event and individual occurrence, you will need to add extra filed with occurrenceID term to multimedia:
<archive xmlns="http://rs.tdwg.org/dwc/text/" metadata="eml.xml">
<core encoding="UTF-8" fieldsTerminatedBy="\t" linesTerminatedBy="\n" fieldsEnclosedBy="" ignoreHeaderLines="1" rowType="http://rs.tdwg.org/dwc/terms/Event">
<files>
<location>event.txt</location>
</files>
<id index="0" />
...other files
</core>
<extension encoding="UTF-8" fieldsTerminatedBy="\t" linesTerminatedBy="\n" fieldsEnclosedBy="" ignoreHeaderLines="1" rowType="http://rs.tdwg.org/dwc/terms/Occurrence">
<files>
<location>occurrence.txt</location>
</files>
<coreid index="0" />
<field index="2" term="http://rs.tdwg.org/dwc/terms/occurrenceID"/>
...other files
</extension>
<extension encoding="UTF-8" fieldsTerminatedBy="\t" linesTerminatedBy="\n" fieldsEnclosedBy="" ignoreHeaderLines="1" rowType="http://rs.tdwg.org/ac/terms/Multimedia">
<files>
<location>multimedia.txt</location>
</files>
<coreid index="0" />
<field index="2" term="http://rs.tdwg.org/dwc/terms/occurrenceID"/>
...other files
</extension>
</archive>
Your data will be then:
event.txt
id | someParentTerm |
---|---|
event_id_777 | event_data_777 |
event_id_888 | event_data_888 |
occurrence.txt | coreid | occurrenceID | someOccurrenceTerm |
---|---|---|---|
event_id_777 | occurrence_id_777_1 | occurrence_term_777_1 | |
event_id_777 | occurrence_id_777_2 | occurrence_term_777_2 | |
event_id_777 | occurrence_id_777_3 | occurrence_term_777_3 | |
event_id_888 | occurrence_id_888_1 | occurrence_term_888_1 | |
event_id_888 | occurrence_id_888_2 | occurrence_term_888_2 |
multimedia.txt | coreid | occurrenceID | someMultimediaTerm |
---|---|---|---|
event_id_777 | occurrence_id_777_1 | multimedia_term_777_1 | |
event_id_777 | occurrence_id_777_2 | multimedia_term_777_2 | |
event_id_777 | occurrence_id_777_2 | multimedia_term_777_2_1 | |
event_id_888 | occurrence_id_888_1 | multimedia_term_888_1 | |
event_id_888 | occurrence_id_888_2 | multimedia_term_888_2 |
After interpretation occurrence data will be represented as: | occurrenceID | someOccurrenceTerm | eventID | someParentTerm | someMultimediaTerm |
---|---|---|---|---|---|
occurrence_id_777_1 | occurrence_term_777_1 | event_id_777 | event_data_777 | multimedia_term_777_1 | |
occurrence_id_777_2 | occurrence_term_777_2 | event_id_777 | event_data_777 | multimedia_term_777_2;multimedia_term_777_2_1 | |
occurrence_id_777_3 | occurrence_term_777_3 | event_id_777 | event_data_777 | ||
occurrence_id_888_1 | occurrence_term_888_1 | event_id_888 | event_data_888 | multimedia_term_888_1 | |
occurrence_id_888_2 | occurrence_term_888_2 | event_id_888 | event_data_888 | multimedia_term_888_2 |
Thanks for the reply @muttcg. The conceptual issue I have with your approach is that it requires linking media to events and observations.
occurrenceID
to multimedia.txt could create inconsistencies (e.g. what if the occurrenceID
is not related to the provided eventID
)occurrenceID
to multimedia.txt would increase the number of records, because I now have to add extra records if more than one observation is linked to the eventID
I would also argue for a regular event model until the unified model is a proper option. Adding occurrenceID to multimedia I would also avoid for reasons given by @peterdesmet and because I prefer simple relationships. It just boils down to how much work can be done in time on the gbif processing.
@peterdesmet - can I ask for a clarification of expected behavior for option 2?
Knowing that we aim to accommodate this properly in the future, but today wrangle this into occurrence records (i.e. it is a hack) would it be reasonable that the event and images that don't have occurrences (E2 and I2) be dropped?
My understanding is that would be those photos that hadn't been annotated by AI or people as having organisms in them.
ETA: The occurrences that are created would all have the images from the event / parent events attached to them, along with inheriting relevant metadata from the events.
... and that an occurrence receives all images from the core event and its parent events that it is linked from? So all 3 occurrence_id_777_x records would have the same 3 images from the 777 event?
@timrobertson100
... would it be reasonable that the event and images that don't have occurrences (E2 and I2) be dropped?
Yes that is reasonable. They would still be available in the source data (in the IPT, in a model that makes sense) and could be indexed in a future implementation.
My understanding is that would be those photos that hadn't been annotated by AI or people as having organisms in them.
Correct.
ETA: The occurrences that are created would all have the images from the event / parent events attached to them, along with inheriting relevant metadata from the events.
Yes, although for my use case parent events would not have images attached to them.
For inheriting metadata from parent events, it is important that information already available at the child event is not overwritten by information from the parent event (which is often less precise/applicable). E.g. in the above example, occurrence O1 is linked to event E1, which has a parent event E. Both E1 and E have eventDate
information. The eventDate
info should be used from the child E1 (2022-05-04T20:18:35Z
), not from the parent E (2022-05-04T20:18:35Z/2022-06-01T03:58:27Z
). For properties that are not defined for the child, but are available for the parent, it makes sense to trickle those down (see #4217).
Thanks, @peterdesmet - that all makes sense to me.
Clarification for the current production ingestion: 1) Child terms values override parents values, fx paren event has eventData 2000, occurrence linked to the event 20/10/2000, final value is 20/10/2000 2) Both occurrence and image records as extensions must be linked to the event record core and when you add occurrenceID to the image it will be attached to the occurrence of the event, so no random images appear for occurrences. This is a work around in the ingestion code and was implemented because some users use that way to link data, so it based on user experience and it is non official neither recommend way
Another possible workaround is to add some ID to multimedia and after to occurrences, and add that functional to the ingestions code:
event.csv
- ev_id1, ev1_etc...
/ \
occ.csv multimedia.csv
1) ev_id1, occ_id1, img_id1, occ_1etc... 1) ev_id1, img_id1, img1_etc...
2) ev_id1, occ_id1, img_id1, occ_2etc...
Result 2 occurrences:
- occ_id1, occ_1etc... , ev1_etc... , img1_etc...
- occ_id2, occ_2etc... , ev1_etc... , img1_etc...
The prototype we developed for ingesting event data works exactly like: temporal and other fields are inherited only in case are not provided at the sub-event level. In the case of occurrences, inheriting data from events has not being considered yet but seems feasible. For the the short-term I suggest that the easiest approach is to follow @muttcg recommendation and used the occurrenceID field and repeat the image for each occurrence linked occurrence
Having looked at the comments and proposals on this thread, I feel we could be a bit bolder and seriously consider if we can commit to Camtrap DP.
Some of the proposals on this thread seem like confusing workarounds which may prove fragile and look to even go against our existing documented standards.
If we went for Camtrap DP:
DwCAReader
) to extract from the data package the occurrences, image references, event IDs and perform the typical indexing of the species occurrence data.
I've discussed this with members of GBIF informatics (@mdoering, @muttcg, @MattBlissett, @fmendezh ) and with @peterdesmet and sensed this seems reasonable to them - they noted their proposals above were really looking for workarounds. @tucotuco and I are also in support with regards to this being well-aligned with ideas emerging from the new GBIF data model investation as a publishing model for this community.
Any thoughts, concerns, or support?
We are writing a guide with recommendations on how to publish camera trap data to GBIF, but would like some advice on which of the following models to recommend:
1. Occurrence core
This model uses an Occurrence core and Audubon Media Description extension. Images are linked directly to occurrences and are correctly displayed by gbif.org for each occurrence. The drawbacks of these model are:
I1
)E2
absent)I2
absent), i.e. images that are considered empty or that were not assessed.E
)Example dataset: https://www.gbif-uat.org/dataset/69010415-a5cb-4f03-9d7f-79f4e10dfadf
2. Event core
This model uses an Event core, Occurrence extension and Audubon Media Description extension. Images are linked to events, so are occurrences. This model solves all the drawbacks of the first model (indicated in red). It also aligns better with the unified model (cc @tucotuco) However, gbif.org does not show the image(s) for the occurrence that are associated to the event of that occurrence. Neither does it derive information from the parent event if it wasn't repeated for the child event (described in https://github.com/gbif/portal-feedback/issues/4217).
Example dataset: https://www.gbif-uat.org/dataset/9664215f-5bf1-472b-a428-257f716d08af
3. Camtrap DP
Camtrap DP (https://tdwg.github.io/camtrap-dp/) is a new model and data format to express all relevant information about a camera trap study. One proof of concept (as part of the work for the unified model) is to allow users to publish data as Camtrap DP using the IPT and have gbif.org understand that model.
@timrobertson100 @tucotuco what should we recommend people now?