gbif / portal-feedback

User feedback for the GBIF API, website and published data. You can ask questions here. 🗨❓
30 stars 16 forks source link

Disappearing records due to referential integrity failure #4491

Open Mesibov opened 1 year ago

Mesibov commented 1 year ago

This issue has previously been discussed in comments here: https://discourse.gbif.org/t/occurrence-records-without-their-event-records/3194/2

In the latest case I've noticed, version 1.5 of this dataset: https://www.gbif.org/dataset/ced770f9-7dd5-49c6-8030-795dd409921a has 838 records in occurrence.txt in the DwCA, but only 546 have been published through GBIF. The reason is that there are 141 eventID entries in occurrence.txt that are not listed in event.txt.

Please advise: (1) Were the referential integrity failures detected by GBIF? (2) Was the data publisher notified?

This particular dataset has been audited for a Pensoft data paper and the next version will probably have the issue fixed.

CecSve commented 1 year ago

(1) Were the referential integrity failures detected by GBIF? (2) Was the data publisher notified?

Currently GBIF does not detect referential integrity and does not notify the publisher. It is something that is planned to be integrated in the future though (see the two issues I have mentioned this issue in above).

albenson-usgs commented 1 year ago

I am really confused by this ticket because I am relatively certain that referential integrity check is one of the checks performed by the IPT and it will not let you publish a dataset if coreIDs are missing.

CecSve commented 1 year ago

@albenson-usgs I made a small test in our test IPT with seven occurrences and seven eventIDs, but with the event core only containing five eventIDs. This meant that only the five occurrences that had a match to the event core got published. I did not receive any information of occurrences being dropped: image

albenson-usgs commented 1 year ago

Thanks for checking that @CecSve! I must have been thinking of the check that core ID is present and unique. Seems like this referential integrity check would be good to add to the IPT.

Mesibov commented 1 year ago

@CecSve Many thanks, it's good to know the referential integrity issue is still on the GBIF "to do" list.