GSA / data.gov

Main repository for the data.gov service
https://data.gov
Other
546 stars 87 forks source link

db-solr-sync gives false positives on "Packages without harvest_object" #4748

Closed FuhuXia closed 1 month ago

FuhuXia commented 1 month ago

db-solr-sync seems to give false positives on "Packages without harvest_object" count.

Found the issue when doing solr clear and reindexing on the current catalog-dev. It reports two positives but the dataset looks fine. Another look at the current production daily report (~ 500 positives) seems to have the same issue.

FuhuXia commented 1 month ago

It turns out db-solr-sync was doing the right thing capturing packages without harvest_objects. The false positive were actually duplicates borrowing other package's harvest_objects.

btylerburton commented 1 month ago

Another look at the current production daily report (~ 500 positives) seems to have the same issue.

Were all the duplicates false positives?

FuhuXia commented 1 month ago

Nope. They are indeed duplicates, and the duplicate packages are using harvest_objects that does not belong to them, that is why db-solr-sync also caught them. We can run either de-dupe process or db-solr-sync task to fix them.