Closed jbrown-xentity closed 1 year ago
Should review #3742 after resolution of this ticket, as that may be resolved as well.
The following SQL script picks up the duplicates (from https://github.com/GSA/data.gov/issues/3567)
SELECT "group".name, COUNT(*) FROM package JOIN "group" ON package.owner_org = "group".id LEFT JOIN harvest_object ON package.id = harvest_object.package_id WHERE package.state='active' AND package.type='dataset' AND harvest_object.package_id IS NULL GROUP BY 1 ORDER BY 2 DESC ;
10-17-2022
doc-gov | 23425
ca-gov | 12455
noaa-gov | 10869
city-of-ferndale-michigan | 795
usaid-gov | 511
city-of-seattle | 494
state-of-connecticut | 474
lake-county-illinois | 348
vcgi-org | 186
loudoun-county-virginia | 147
hhs-gov | 138
city-of-baltimore | 111
state-of-oklahoma | 100
ed-gov | 90
doe-gov | 64
usgs-gov | 54
city-of-new-york | 40
sba-gov | 36
national-institute-of-standards-and-technology | 35
wake-county | 33
epa-gov | 33
doi-gov | 30
federal-laboratory-consortium | 29
city-of-sioux-falls | 14
dot-gov | 13
city-of-austin | 11
usitc-gov | 6
city-of-tempe | 5
king-county-washington | 5
state-of-new-york | 5
centers-for-disease-control-and-prevention | 4
fema-gov | 3
usda-gov | 3
city-of-los-angeles | 2
state-of-maryland | 2
city-of-bloomington | 2
va-gov | 2
state-gov | 1
rrb-gov | 1
census-gov | 1
doj-gov | 1
city-of-baton-rouge | 1
fcc-gov | 1
Following sql to update the package state:
UPDATE package
SET state = 'to_delete'
where id in (
SELECT package.id
FROM package
JOIN "group" ON package.owner_org = "group".id
LEFT JOIN harvest_object ON package.id = harvest_object.package_id
WHERE package.state='active' AND package.type='dataset'
AND harvest_object.package_id IS NULL
AND "group".name in (<group_names>)
);
Clear the harvest source in the sandbox test org. Run db-sync-solr job manually to clear solr.
Will do following cleanup today:
10-18-2022
doc-gov | 23425
ca-gov | 12455
noaa-gov | 10869
10-24-2022 There is new one duplicate in dhs-gov today.
just cleaned up duplicates for ca-gov, it only took about 4 min for 12455 records with new deletion method (defer the commit to the end).
just cleaned up duplicates for ca-gov, it only took about 4 min for 12455 records with new deletion method (defer the commit to the end).
That is 50/sec on deleting, faster than 10/sec adding/updating. What makes the speed difference?
The new delete function, only has one solr connection for all deletions. And the adding/updating has new connection for each call.
The following duplicates are also be cleared : doc-gov 23425 noaa-gov 10869
So there is no duplicates in DB as of today. Will continue monitor for couple days to see if we get new duplicates.
checked the duplicate today, no new item returned
SELECT "group".name, COUNT(*) FROM package JOIN "group" ON package.owner_org = "group".id LEFT JOIN harvest_object ON package.id = harvest_object.package_id WHERE package.state='active' AND package.type='dataset' AND harvest_object.package_id IS NULL GROUP BY 1 ORDER BY 2 DESC ;
name | count ------+------- (0 rows)
To eliminate packages that have no current harvest_object, we can use this query.
SELECT "group".name, COUNT(*)
FROM package
JOIN "group" ON package.owner_org = "group".id
LEFT JOIN harvest_object ON package.id = harvest_object.package_id AND harvest_object.current
WHERE package.state='active'
AND package.type='dataset'
AND harvest_object.package_id IS NULL
GROUP BY 1 ORDER BY 2 DESC;
name | count
-----------+-------
noaa-gov | 37695
usda-gov | 27
epa-gov | 25
usaid-gov | 8
fgdc-gov | 1
dot-gov | 1
doi-gov | 1
(7 rows)
query result returns 0. This ticket can be closed.
I might have confused https://github.com/GSA/data.gov/issues/3567 with this ticket. Or maybe the same thing happened twice? Even having 1.6 years of experience on data.gov, I wouldn't touch the production DB without Fuhu around. @Jin-Sun-tts did it twice!
This ticket removed the bad data from being searchable or discoverable by users. But it is still in the system (hence https://github.com/GSA/data.gov/issues/3999). As @FuhuXia mentioned above, the validating query above shows success for this ticket.
Since we re-implemented the db-solr sync, we found that we have data in a bad state sitting in prod (not indexed on solr, but still valid). We need to clear this bad data
How to reproduce
Expected behavior
The above query should result in 0 datasets
Actual behavior
Thousands
Sketch
The following organizations have duplicates, this may be affecting all of them (or just some), and they are sorted in highest value order:
The process to follow for each organization:
SELECT id FROM "group" WHERE name = 'doc-gov';
org-id
andharvest-source-id
with the values abovePlease note that the above marking
to_delete
is to match ckanext-harvest clearing. We also need to delete the records in SOLR, so we need to do a full harvest clear. An alternative approach would be manually running the db-solr-sync job after deleting the records, and validating that the job removed the records from solr...