ThreeSixtyGiving / datastore

A Data Store application for 360Giving
GNU Affero General Public License v3.0
0 stars 1 forks source link

Duplicate grants in grantnav after data load #11

Closed drkane closed 4 years ago

drkane commented 5 years ago

The issue that was mentioned in a plan.io thread previously has re-occured - we've noticed it a couple of times (and @KDuerden found it this morning).

A number of grants seem to be duplicated - this is particularly noticable in the Big Lottery Fund data but may be occuring in other funders.

As of now, the latest grantnav is showing 393,000 grants, compared to 335,000 on the currently live version (there are some differences in publishers but not enough to make that difference). Here's an example of a grant that is showing as duplicate: http://latest.es7.grantnav-dev.default.threesixtygiving.uk0.bigv.io/grant/360G-blf-0030045912

(screenshot of the grant) Screenshot_2019-09-23 360Giving GrantNav

Interestingly the grant that I used as an example of the issue in the plan.io thread is now fine, so it seems to be not consistent in which grants it affects.

Not sure at what point the issue occurs - in the datastore load, the file generated for grantnav or the process of uploading to grantnav.

KDuerden commented 4 years ago

379,163 grants this morning, with Big Lottery on 245589 rather than 202851. No dupe for the particular grant record above though.

ETA each time I've tested (by comparing live Sept GN vs Live_Dev GN) the only difference not explained by know changes to the files/publishers' data have been in Big Lottery data. I think it might be localised.

michaelwood commented 4 years ago

Hmm interesting, I downloaded the dataset from 23rd (thanks for the screenshot!) and wrote a quite script to test for unique ids, from those results it doesn't appear to have any duplicates in the data itself (though it is clearly shown in the interface). I am wondering if this is caused by a race condition on refreshing the data. I've now changed the way it deletes the old data as my theory is that it might have only gotten so far in the deletion before starting to reload again. If that doesn't fix it the next thing I'll look at will be to see if this is a regression caused by Elastic 7 changes.

KDuerden commented 4 years ago

thanks @michaelwood. BLF data still showing inflated grant numbers - 248,089 today!

drkane commented 4 years ago

The issue seems to have recurred today.

There's a related issue in the opposite direction too - where a funder has (incorrectly) reused the same grant ID, in the live grantnav this is shown as a duplicate, whereas in the es7 version it's not. This was seen yesterday, but isn't currently replicated:

The correct behaviour for GrantNav in this case is to show both grants with that ID, because we can't choose which to show.

drkane commented 4 years ago

This issue seems to be recurring again today: BLF showing 245,089 on http://latest.es7.grantnav-dev.default.threesixtygiving.uk0.bigv.io and 202,851 on https://grantnav.threesixtygiving.org

robredpath commented 4 years ago

We discussed this last week when we met and on a call with the developers this morning. There are two things here:

KDuerden commented 4 years ago

thanks @robredpath Re: the genuine dupes - we'd always be considering these as mistakes and aim to work with publishers to correct when we find them. In current GN I thought a grant with a dupe ID is only be counted once - is that behaviour different in the dev GN?

ETA ignore me, it's the other way around in current GN!

michaelwood commented 4 years ago

Closing for now as we've not seen this issue reoccurring. It should also now be simple to avoid this happening before the data reaches GrantNav if it does happen again.