Open sckott opened 3 years ago
Any ideas?
I couldn't replicate this issue, ran several tests with the search parameters reported in this issue, I assume this can actually happen in cases when the Elasticsearch index is being updated or rebalanced and the order of elements in search results can change from page to page
Would it be practical to have some way to tell the user this has happened? For example, some sort of hash in the response, calculated from all applicable index shards, which would change if the content of one or more shards were changed.
The user would need to restart their query, but that is better than having duplicate (and missing) data.
Apologies this isn't reproducible.
some sort of hash in the response,
@MattBlissett I assume you're suggesting you'd include a hash in the response? Or are you saying client side we should calculate a hash?
The user would need to restart their query
Do you prefer they do a new query or would it be better that we automatically remove duplicates - or provide tools for the user to identify them and remove them?
I was asking Fede really (our search expert), I don't know what is practical. If a hash were calculated server side, the client would be able to restart the query if the hash changed.
The client shouldn't remove duplicates -- if there are duplicates, then there are also missing records, so it's better to start again. I think this is what is happening:
Records in index: A B C G H I J K L M N O
Query results, page size 3: ABC GHI
Dataset updated, records in updated index: A B C D E F G H I J K L
Querying continues from record 7: GHI JKL — G H I are duplicates, and the new records D E F are missing.
Okay, make sense. I see now why you said there would be missing records too - good reason not to do this client side
An
rgbif
user reported to me finding duplicate records using the/occurrence/search
route. The query they shared that created duplicates washttps://api.gbif.org/v1/occurrence/search?hasCoordinate=true&hasGeospatialIssue=false&scientificName=Mentha%20arvensis
with many requests to that query with pagination to get 50,000 results.
One possibility is that
rgbif
is somehow messing up pagination and doing the same exact requests multiple times, or at least with overlapping pagination parameters, but I don't think that's the case because they said sometimes they get and sometimes not with the same query.So my guess is that sometimes duplicate records are returned from the
/occurrence/search
route. Is that possible?The duplicates (see the attachment) have all fields the same, so not just keys, but all fields, so that is probably meaningful.
However it happens, I guess in
rgbif
andpygbif
we could automatically remove complete duplicates for the user - i'm a bit hesitant to do that, but if we don't, users may be very surprised to find duplicates and not know what to do about itdups.txt