gbif / occurrence

Occurrence store, download, search
Apache License 2.0
22 stars 15 forks source link

Duplicate records in /occurrence/search #230

Open sckott opened 3 years ago

sckott commented 3 years ago

An rgbif user reported to me finding duplicate records using the /occurrence/search route. The query they shared that created duplicates was

https://api.gbif.org/v1/occurrence/search?hasCoordinate=true&hasGeospatialIssue=false&scientificName=Mentha%20arvensis

with many requests to that query with pagination to get 50,000 results.

One possibility is that rgbif is somehow messing up pagination and doing the same exact requests multiple times, or at least with overlapping pagination parameters, but I don't think that's the case because they said sometimes they get and sometimes not with the same query.

So my guess is that sometimes duplicate records are returned from the /occurrence/search route. Is that possible?

The duplicates (see the attachment) have all fields the same, so not just keys, but all fields, so that is probably meaningful.

However it happens, I guess in rgbif and pygbif we could automatically remove complete duplicates for the user - i'm a bit hesitant to do that, but if we don't, users may be very surprised to find duplicates and not know what to do about it

dups.txt

sckott commented 3 years ago

Any ideas?

fmendezh commented 3 years ago

I couldn't replicate this issue, ran several tests with the search parameters reported in this issue, I assume this can actually happen in cases when the Elasticsearch index is being updated or rebalanced and the order of elements in search results can change from page to page

MattBlissett commented 3 years ago

Would it be practical to have some way to tell the user this has happened? For example, some sort of hash in the response, calculated from all applicable index shards, which would change if the content of one or more shards were changed.

The user would need to restart their query, but that is better than having duplicate (and missing) data.

sckott commented 3 years ago

Apologies this isn't reproducible.

some sort of hash in the response,

@MattBlissett I assume you're suggesting you'd include a hash in the response? Or are you saying client side we should calculate a hash?

The user would need to restart their query

Do you prefer they do a new query or would it be better that we automatically remove duplicates - or provide tools for the user to identify them and remove them?

MattBlissett commented 3 years ago

I was asking Fede really (our search expert), I don't know what is practical. If a hash were calculated server side, the client would be able to restart the query if the hash changed.

The client shouldn't remove duplicates -- if there are duplicates, then there are also missing records, so it's better to start again. I think this is what is happening:

Records in index: A B C G H I J K L M N O

Query results, page size 3: ABC GHI

Dataset updated, records in updated index: A B C D E F G H I J K L

Querying continues from record 7: GHI JKL — G H I are duplicates, and the new records D E F are missing.

sckott commented 3 years ago

Okay, make sense. I see now why you said there would be missing records too - good reason not to do this client side