Closed djtfmartin closed 1 year ago
The summary view starts by calling SOLR in ALA and then for each checks GBIF.
Looks like this problem only affects the following 4 data resources :
which i think we dynamically (periodically) harvest, so perhaps its just a timing issue with index generation.
The archives are only created once a month, where data resources and solr are updated more frequently. However, we have a single executor/agent in Jenkins for executing the Solr index generation, the data loading, and the archive generation, so it shouldn't be an issue with overlapping operations. We have stayed on that single executor model for the sole reason of avoiding inconsistencies, as it otherwise limits us to only be running a single operation at one time.
The first place to look would probably be the index creation, using the Jenkins logs on cassandra-b4.The index generation process is generally to be very lenient with errors, by silently ignoring them, but sometimes it does log the errors, so you may be able to find the errors in the logs.
Otherwise, you may need to do a UUID comparison, but given that GBIF don't have the actual occurrenceIDs, you will need to do a dump on the ALA Cassandra database to identify the original record primary keys in each case based on the UUID's that GBIF have that ALA Solr doesn't currently have. Then you would need to try to individually reindex those records using some verbose error mode or in a debugger to see if there are errors.
However, given how lenient biocache-store is for errors during operations, I am not surprised that there are differences in the numbers. It is a product of its development timeline and architecture. We have a project to be scheduled to split the components out of biocache-store into separate maintainable libraries so we can start adding test cases for each part and allow more flexible architecture choices, but it hasn't got past initial conception at this stage.
GBIF have the occurrenceIDs that we supplied - i.e. the UUIDs we create and store in cassandra. These are in the downloads from GBIF (and are visible on GBIF occurrence record pages) so they can be used to identify the records that havent been indexed by us but supplied to them.
Besides, this is just an ALA problem with indexing in the data in our own DB. We can just export UUIDs from the SOLR index and do a comparison with the export from Cassandra to work out which records haven't been indexed.
The GBIF DwCA exports from cassandra are providing a different number of records for a number of data resources to what we currently have indexed in SOLR.
Examples:
I ran the following to determine what we have in the DB
biocache export -c uuid -s 'dr2287|' -e 'dr2287|~' occ /tmp/dr2287-uuid.txt
biocache export -c uuid -s 'dr340|' -e 'dr3407|~' occ /tmp/dr340-uuid.txt
The counts from these exports match the numbers GBIF are seeing. The difference is usually GBIF have more records.
So my guess is we have a SOLR indexing bug.
Theres a breakdown of the numbers here:
http://s3-eu-west-1.amazonaws.com/tim-oz-datasets/index-oz.html
cc @timrobertson100 @ansell @sadeghim