Clear bad production data

jbrown-xentity commented 2 years ago

Since we re-implemented the db-solr sync, we found that we have data in a bad state sitting in prod (not indexed on solr, but still valid). We need to clear this bad data

How to reproduce

Examine duplicate records that don't have a correct/corresponding record in harvest_object table. A full table scan doesn't seem to complete, but you can look by harvest source:

SELECT COUNT(*), max(package.metadata_created) 
FROM package 
LEFT JOIN "group" ON package.owner_org = "group".id 
WHERE "group".name = 'dhs-gov' 
AND package.type = 'dataset' 
AND package.id NOT IN (SELECT package_id FROM harvest_object WHERE harvest_source_id = '803bdba9-bfcb-453c-ae2a-ed81f240ff5a' AND current);

Expected behavior

The above query should result in 0 datasets

Actual behavior

Thousands

Sketch

The following organizations have duplicates, this may be affecting all of them (or just some), and they are sorted in highest value order:

[x] noaa-gov
[x] doi-gov
[x] doc-gov
[x] ca-gov
[x] dhs-gov
[x] city-of-ferndale-michigan
[x] lake-county-illinois
[x] hhs-gov
[x] state-of-oklahoma
[x] ed-gov
[x] usaid-gov
[x] city-of-new-york
[x] doe-gov
[x] epa-gov
[x] federal-laboratory-consortium
[x] state-of-connecticut
[x] usitc-gov
[x] king-county-washington
[x] city-of-tempe
[x] state-of-new-york
[x] national-institute-of-standards-and-technology
[x] centers-for-disease-control-and-prevention
[x] dot-gov
[x] city-of-baltimore
[x] city-of-austin
[x] usda-gov
[x] city-of-bloomington
[x] city-of-los-angeles
[x] va-gov
[x] state-of-maryland
[x] fcc-gov
[x] rrb-gov
[x] city-of-baton-rouge
[x] fgdc-gov
[x] doj-gov

The process to follow for each organization:

Find the offending harvest source (if more than one, will involve more analysis)
Get the harvest source id (via UI)
Get the organization id: SELECT id FROM "group" WHERE name = 'doc-gov';
If more than one harvest source for the org, need to update the following query

Replace the following query values org-id and harvest-source-id with the values above

UPDATE package 
SET state = 'to_delete' 
WHERE package.owner_org = 'org-id' 
AND package.type = 'dataset' 
AND package.id NOT IN (
SELECT package_id 
FROM harvest_object 
WHERE harvest_source_id = 'harvest-source-id' 
AND current
);

Clear the harvest source (either via UI or via run-task)
Re-Harvest the source

Please note that the above marking to_delete is to match ckanext-harvest clearing. We also need to delete the records in SOLR, so we need to do a full harvest clear. An alternative approach would be manually running the db-solr-sync job after deleting the records, and validating that the job removed the records from solr...

jbrown-xentity commented 2 years ago

Should review #3742 after resolution of this ticket, as that may be resolved as well.

Jin-Sun-tts commented 2 years ago

The following SQL script picks up the duplicates (from https://github.com/GSA/data.gov/issues/3567)

SELECT "group".name, COUNT(*) FROM package JOIN "group" ON package.owner_org = "group".id LEFT JOIN harvest_object ON package.id = harvest_object.package_id WHERE package.state='active' AND package.type='dataset' AND harvest_object.package_id IS NULL GROUP BY 1 ORDER BY 2 DESC ;

10-17-2022

doc-gov                                        | 23425
 ca-gov                                         | 12455
 noaa-gov                                       | 10869
 city-of-ferndale-michigan                      |   795
 usaid-gov                                      |   511
 city-of-seattle                                |   494
 state-of-connecticut                           |   474
 lake-county-illinois                           |   348
 vcgi-org                                       |   186
 loudoun-county-virginia                        |   147
 hhs-gov                                        |   138
 city-of-baltimore                              |   111
 state-of-oklahoma                              |   100
 ed-gov                                         |    90
 doe-gov                                        |    64
 usgs-gov                                       |    54
 city-of-new-york                               |    40
 sba-gov                                        |    36
 national-institute-of-standards-and-technology |    35
 wake-county                                    |    33
 epa-gov                                        |    33
 doi-gov                                        |    30
 federal-laboratory-consortium                  |    29
 city-of-sioux-falls                            |    14
 dot-gov                                        |    13
 city-of-austin                                 |    11
 usitc-gov                                      |     6
 city-of-tempe                                  |     5
 king-county-washington                         |     5
 state-of-new-york                              |     5
 centers-for-disease-control-and-prevention     |     4
 fema-gov                                       |     3
 usda-gov                                       |     3
 city-of-los-angeles                            |     2
 state-of-maryland                              |     2
 city-of-bloomington                            |     2
 va-gov                                         |     2
 state-gov                                      |     1
 rrb-gov                                        |     1
 census-gov                                     |     1
 doj-gov                                        |     1
 city-of-baton-rouge                            |     1
 fcc-gov                                        |     1

Following sql to update the package state:

UPDATE package 
SET state = 'to_delete' 
where id in (
    SELECT package.id
    FROM package
    JOIN "group" ON package.owner_org = "group".id 
    LEFT JOIN harvest_object ON package.id = harvest_object.package_id 
    WHERE package.state='active' AND package.type='dataset'   
    AND harvest_object.package_id IS NULL  
    AND "group".name in (<group_names>)
);

Clear the harvest source in the sandbox test org. Run db-sync-solr job manually to clear solr.

Jin-Sun-tts commented 2 years ago

Will do following cleanup today:

10-18-2022

 doc-gov                                        | 23425
 ca-gov                                         | 12455
 noaa-gov                                       | 10869

Jin-Sun-tts commented 2 years ago

10-24-2022 There is new one duplicate in dhs-gov today.

Jin-Sun-tts commented 2 years ago

just cleaned up duplicates for ca-gov, it only took about 4 min for 12455 records with new deletion method (defer the commit to the end).

FuhuXia commented 2 years ago

just cleaned up duplicates for ca-gov, it only took about 4 min for 12455 records with new deletion method (defer the commit to the end).

That is 50/sec on deleting, faster than 10/sec adding/updating. What makes the speed difference?

Jin-Sun-tts commented 2 years ago

The new delete function, only has one solr connection for all deletions. And the adding/updating has new connection for each call.

Jin-Sun-tts commented 2 years ago

The following duplicates are also be cleared : doc-gov 23425 noaa-gov 10869

So there is no duplicates in DB as of today. Will continue monitor for couple days to see if we get new duplicates.

Jin-Sun-tts commented 1 year ago

checked the duplicate today, no new item returned

SELECT "group".name, COUNT(*) FROM package JOIN "group" ON package.owner_org = "group".id LEFT JOIN harvest_object ON package.id = harvest_object.package_id WHERE package.state='active' AND package.type='dataset' AND harvest_object.package_id IS NULL GROUP BY 1 ORDER BY 2 DESC ;

name | count ------+------- (0 rows)

FuhuXia commented 1 year ago

To eliminate packages that have no current harvest_object, we can use this query.

SELECT "group".name, COUNT(*)
FROM package
JOIN "group" ON package.owner_org = "group".id 
LEFT JOIN harvest_object ON package.id = harvest_object.package_id AND harvest_object.current
WHERE package.state='active'
AND package.type='dataset'
AND harvest_object.package_id IS NULL
GROUP BY 1 ORDER BY 2 DESC;

   name    | count
-----------+-------
 noaa-gov  | 37695
 usda-gov  |    27
 epa-gov   |    25
 usaid-gov |     8
 fgdc-gov  |     1
 dot-gov   |     1
 doi-gov   |     1
(7 rows)

FuhuXia commented 1 year ago

query result returns 0. This ticket can be closed.

nickumia-reisys commented 1 year ago

I might have confused https://github.com/GSA/data.gov/issues/3567 with this ticket. Or maybe the same thing happened twice? Even having 1.6 years of experience on data.gov, I wouldn't touch the production DB without Fuhu around. @Jin-Sun-tts did it twice!

This ticket removed the bad data from being searchable or discoverable by users. But it is still in the system (hence https://github.com/GSA/data.gov/issues/3999). As @FuhuXia mentioned above, the validating query above shows success for this ticket.

GSA / data.gov