GBIF export counts inconsistent with ALA indexed counts

AtlasOfLivingAustralia / biocache-store

Occurrence processing, indexing and batch processing

Other

7 stars 24 forks source link

GBIF export counts inconsistent with ALA indexed counts #223

Closed djtfmartin closed 1 year ago

djtfmartin commented 6 years ago

The GBIF DwCA exports from cassandra are providing a different number of records for a number of data resources to what we currently have indexed in SOLR.

Examples:

http://collections.ala.org.au/public/show/dr2287
ALA index count: 523,117 records
GBIF export count: 523,184 records

http://collections.ala.org.au/public/show/dr340
ALA index count: 1,330,614 records
GBIF export count: 1,330,706 records

I ran the following to determine what we have in the DB

biocache export -c uuid -s 'dr2287|' -e 'dr2287|~' occ /tmp/dr2287-uuid.txt biocache export -c uuid -s 'dr340|' -e 'dr3407|~' occ /tmp/dr340-uuid.txt

The counts from these exports match the numbers GBIF are seeing. The difference is usually GBIF have more records.

So my guess is we have a SOLR indexing bug.

Theres a breakdown of the numbers here:

http://s3-eu-west-1.amazonaws.com/tim-oz-datasets/index-oz.html

cc @timrobertson100 @ansell @sadeghim

timrobertson100 commented 6 years ago

The summary view starts by calling SOLR in ALA and then for each checks GBIF.

Where it is yellow, GBIF have more records (could indicate issue in GBIF indexing or inconsistencies between ALA SOLR+Cassandra)
Where it is red, GBIF have less records (could indicate issue in GBIF, or ALA SOLR has too many records)
Where it is white, please consider if it could/should be registered in GBIF. Datasets with incompatible licenses for example will remain white in this view. Many datasets should not (e.g. eBird, the GBIF.org sourced data) and where it is consciously decided, please mark them as "notShareable" explicitly so others don't need to investigate.

djtfmartin commented 6 years ago

Looks like this problem only affects the following 4 data resources :

Australian Museum provider for OZCAM
Museums Victoria provider for OZCAM
Queensland Herbarium Records
South Australian Museum Australia provider for OZCAM

which i think we dynamically (periodically) harvest, so perhaps its just a timing issue with index generation.

ansell commented 6 years ago

The archives are only created once a month, where data resources and solr are updated more frequently. However, we have a single executor/agent in Jenkins for executing the Solr index generation, the data loading, and the archive generation, so it shouldn't be an issue with overlapping operations. We have stayed on that single executor model for the sole reason of avoiding inconsistencies, as it otherwise limits us to only be running a single operation at one time.

The first place to look would probably be the index creation, using the Jenkins logs on cassandra-b4.The index generation process is generally to be very lenient with errors, by silently ignoring them, but sometimes it does log the errors, so you may be able to find the errors in the logs.

Otherwise, you may need to do a UUID comparison, but given that GBIF don't have the actual occurrenceIDs, you will need to do a dump on the ALA Cassandra database to identify the original record primary keys in each case based on the UUID's that GBIF have that ALA Solr doesn't currently have. Then you would need to try to individually reindex those records using some verbose error mode or in a debugger to see if there are errors.

However, given how lenient biocache-store is for errors during operations, I am not surprised that there are differences in the numbers. It is a product of its development timeline and architecture. We have a project to be scheduled to split the components out of biocache-store into separate maintainable libraries so we can start adding test cases for each part and allow more flexible architecture choices, but it hasn't got past initial conception at this stage.

djtfmartin commented 6 years ago

GBIF have the occurrenceIDs that we supplied - i.e. the UUIDs we create and store in cassandra. These are in the downloads from GBIF (and are visible on GBIF occurrence record pages) so they can be used to identify the records that havent been indexed by us but supplied to them.

Besides, this is just an ALA problem with indexing in the data in our own DB. We can just export UUIDs from the SOLR index and do a comparison with the export from Cassandra to work out which records haven't been indexed.