gbif / portal-feedback

User feedback for the GBIF API, website and published data. You can ask questions here. 🗨❓
30 stars 16 forks source link

Attaching images to events (camera traps) #4216

Open peterdesmet opened 2 years ago

peterdesmet commented 2 years ago

We are writing a guide with recommendations on how to publish camera trap data to GBIF, but would like some advice on which of the following models to recommend:

1. Occurrence core

This model uses an Occurrence core and Audubon Media Description extension. Images are linked directly to occurrences and are correctly displayed by gbif.org for each occurrence. The drawbacks of these model are:

Screenshot 2022-08-16 at 12 08 09

Example dataset: https://www.gbif-uat.org/dataset/69010415-a5cb-4f03-9d7f-79f4e10dfadf

2. Event core

This model uses an Event core, Occurrence extension and Audubon Media Description extension. Images are linked to events, so are occurrences. This model solves all the drawbacks of the first model (indicated in red). It also aligns better with the unified model (cc @tucotuco) However, gbif.org does not show the image(s) for the occurrence that are associated to the event of that occurrence. Neither does it derive information from the parent event if it wasn't repeated for the child event (described in https://github.com/gbif/portal-feedback/issues/4217).

Example dataset: https://www.gbif-uat.org/dataset/9664215f-5bf1-472b-a428-257f716d08af

Screenshot 2022-08-16 at 12 08 23

3. Camtrap DP

Camtrap DP (https://tdwg.github.io/camtrap-dp/) is a new model and data format to express all relevant information about a camera trap study. One proof of concept (as part of the work for the unified model) is to allow users to publish data as Camtrap DP using the IPT and have gbif.org understand that model.


@timrobertson100 @tucotuco what should we recommend people now?

tucotuco commented 2 years ago

I can't answer anything about timing at GBIF, but the Event-based mode 2 would have broader impact than either of the other two models, so I would recommend that GBIF enable that in any case.

peterdesmet commented 2 years ago

Originally posted by @muttcg in https://github.com/gbif/portal-feedback/issues/4217#issuecomment-1217961207

@peterdesmet I think you need to re-link the data to fix the issue.

Event with ID 4c1e45dd-51d5-4e2f-9bbf-c07d76acfc1c has no location information and is has some "image" type event-4c1e45dd-51d5-4e2f-9bbf-c07d76acfc1c.txt

DWCA uses meta.xml to describe relationship between files, core and extensions linked to core via coreid, dwca reader uses id and coreid fields to link core file with extensions files. If you want to have one event record with multiple occurrences and multiple multimedia linked to event and individual occurrence, you will need to add extra filed with occurrenceID term to multimedia:

<archive xmlns="http://rs.tdwg.org/dwc/text/" metadata="eml.xml">
  <core encoding="UTF-8" fieldsTerminatedBy="\t" linesTerminatedBy="\n" fieldsEnclosedBy="" ignoreHeaderLines="1" rowType="http://rs.tdwg.org/dwc/terms/Event">
    <files>
      <location>event.txt</location>
    </files>
    <id index="0" />
    ...other files
  </core>
  <extension encoding="UTF-8" fieldsTerminatedBy="\t" linesTerminatedBy="\n" fieldsEnclosedBy="" ignoreHeaderLines="1" rowType="http://rs.tdwg.org/dwc/terms/Occurrence">
    <files>
      <location>occurrence.txt</location>
    </files>
    <coreid index="0" />
    <field index="2" term="http://rs.tdwg.org/dwc/terms/occurrenceID"/>
    ...other files
  </extension>
  <extension encoding="UTF-8" fieldsTerminatedBy="\t" linesTerminatedBy="\n" fieldsEnclosedBy="" ignoreHeaderLines="1" rowType="http://rs.tdwg.org/ac/terms/Multimedia">
    <files>
      <location>multimedia.txt</location>
    </files>
    <coreid index="0" />
    <field index="2" term="http://rs.tdwg.org/dwc/terms/occurrenceID"/>
    ...other files
  </extension>
</archive>

Your data will be then:

event.txt

id someParentTerm
event_id_777 event_data_777
event_id_888 event_data_888
occurrence.txt coreid occurrenceID someOccurrenceTerm
event_id_777 occurrence_id_777_1 occurrence_term_777_1
event_id_777 occurrence_id_777_2 occurrence_term_777_2
event_id_777 occurrence_id_777_3 occurrence_term_777_3
event_id_888 occurrence_id_888_1 occurrence_term_888_1
event_id_888 occurrence_id_888_2 occurrence_term_888_2
multimedia.txt coreid occurrenceID someMultimediaTerm
event_id_777 occurrence_id_777_1 multimedia_term_777_1
event_id_777 occurrence_id_777_2 multimedia_term_777_2
event_id_777 occurrence_id_777_2 multimedia_term_777_2_1
event_id_888 occurrence_id_888_1 multimedia_term_888_1
event_id_888 occurrence_id_888_2 multimedia_term_888_2
After interpretation occurrence data will be represented as: occurrenceID someOccurrenceTerm eventID someParentTerm someMultimediaTerm
occurrence_id_777_1 occurrence_term_777_1 event_id_777 event_data_777 multimedia_term_777_1
occurrence_id_777_2 occurrence_term_777_2 event_id_777 event_data_777 multimedia_term_777_2;multimedia_term_777_2_1
occurrence_id_777_3 occurrence_term_777_3 event_id_777 event_data_777
occurrence_id_888_1 occurrence_term_888_1 event_id_888 event_data_888 multimedia_term_888_1
occurrence_id_888_2 occurrence_term_888_2 event_id_888 event_data_888 multimedia_term_888_2
peterdesmet commented 2 years ago

Thanks for the reply @muttcg. The conceptual issue I have with your approach is that it requires linking media to events and observations.

  1. That relationship could arguably be inferred (i.e. one can show the media of an event the observation is linked to)
  2. Adding an occurrenceID to multimedia.txt could create inconsistencies (e.g. what if the occurrenceID is not related to the provided eventID)
  3. Adding an occurrenceID to multimedia.txt would increase the number of records, because I now have to add extra records if more than one observation is linked to the eventID
mdoering commented 2 years ago

I would also argue for a regular event model until the unified model is a proper option. Adding occurrenceID to multimedia I would also avoid for reasons given by @peterdesmet and because I prefer simple relationships. It just boils down to how much work can be done in time on the gbif processing.

timrobertson100 commented 2 years ago

@peterdesmet - can I ask for a clarification of expected behavior for option 2?

Knowing that we aim to accommodate this properly in the future, but today wrangle this into occurrence records (i.e. it is a hack) would it be reasonable that the event and images that don't have occurrences (E2 and I2) be dropped?

My understanding is that would be those photos that hadn't been annotated by AI or people as having organisms in them.

ETA: The occurrences that are created would all have the images from the event / parent events attached to them, along with inheriting relevant metadata from the events.

mdoering commented 2 years ago

... and that an occurrence receives all images from the core event and its parent events that it is linked from? So all 3 occurrence_id_777_x records would have the same 3 images from the 777 event?

peterdesmet commented 2 years ago

@timrobertson100

... would it be reasonable that the event and images that don't have occurrences (E2 and I2) be dropped?

Yes that is reasonable. They would still be available in the source data (in the IPT, in a model that makes sense) and could be indexed in a future implementation.

My understanding is that would be those photos that hadn't been annotated by AI or people as having organisms in them.

Correct.

ETA: The occurrences that are created would all have the images from the event / parent events attached to them, along with inheriting relevant metadata from the events.

Yes, although for my use case parent events would not have images attached to them.

For inheriting metadata from parent events, it is important that information already available at the child event is not overwritten by information from the parent event (which is often less precise/applicable). E.g. in the above example, occurrence O1 is linked to event E1, which has a parent event E. Both E1 and E have eventDate information. The eventDate info should be used from the child E1 (2022-05-04T20:18:35Z), not from the parent E (2022-05-04T20:18:35Z/2022-06-01T03:58:27Z). For properties that are not defined for the child, but are available for the parent, it makes sense to trickle those down (see #4217).

timrobertson100 commented 2 years ago

Thanks, @peterdesmet - that all makes sense to me.

muttcg commented 2 years ago

Clarification for the current production ingestion: 1) Child terms values override parents values, fx paren event has eventData 2000, occurrence linked to the event 20/10/2000, final value is 20/10/2000 2) Both occurrence and image records as extensions must be linked to the event record core and when you add occurrenceID to the image it will be attached to the occurrence of the event, so no random images appear for occurrences. This is a work around in the ingestion code and was implemented because some users use that way to link data, so it based on user experience and it is non official neither recommend way

Another possible workaround is to add some ID to multimedia and after to occurrences, and add that functional to the ingestions code:

                event.csv
                - ev_id1, ev1_etc...
      /                                   \
occ.csv                                       multimedia.csv
1) ev_id1, occ_id1, img_id1, occ_1etc...      1) ev_id1, img_id1, img1_etc...
2) ev_id1, occ_id1, img_id1, occ_2etc...

Result 2 occurrences:
- occ_id1, occ_1etc... , ev1_etc... , img1_etc...
- occ_id2, occ_2etc... , ev1_etc... , img1_etc...
fmendezh commented 2 years ago

The prototype we developed for ingesting event data works exactly like: temporal and other fields are inherited only in case are not provided at the sub-event level. In the case of occurrences, inheriting data from events has not being considered yet but seems feasible. For the the short-term I suggest that the easiest approach is to follow @muttcg recommendation and used the occurrenceID field and repeat the image for each occurrence linked occurrence

timrobertson100 commented 2 years ago

Having looked at the comments and proposals on this thread, I feel we could be a bit bolder and seriously consider if we can commit to Camtrap DP.

Some of the proposals on this thread seem like confusing workarounds which may prove fragile and look to even go against our existing documented standards.

If we went for Camtrap DP:

I've discussed this with members of GBIF informatics (@mdoering, @muttcg, @MattBlissett, @fmendezh ) and with @peterdesmet and sensed this seems reasonable to them - they noted their proposals above were really looking for workarounds. @tucotuco and I are also in support with regards to this being well-aligned with ideas emerging from the new GBIF data model investation as a publishing model for this community.

Any thoughts, concerns, or support?