Closed philipashlock closed 9 years ago
Not sure how it happened. Can't blame the code because if we start everything fresh for the same harvest source, the accessURL
will be harvested fine.
If we already have a list of the datasets to be corrected, we can use api to reset the source_hash
for the listed dataset, then a reharvest will correct the error.
If we don't have the list yet, we should be able to run a query on the db server to list all the problematic dataset, then reset source_hash
and reharvest them.
@FuhuXia Perhaps this should be filtered by harvest source type, but you can just use the num_resources:0
filter to see these - http://catalog.data.gov/dataset?q=num_resources%3A0
The 107 link was just for USDA and there it seems like it's affecting all the ones not accounted for by the dashboard (78 of 107)
All datajson datasets with num_resources:0 have been set to source_hash:"", so they will be updated by next harvest job as scheduled.
@FuhuXia @philipashlock Looks like DOJ and DHS agency json contains datasets with no accessURL, so we will still have some legitimate entries with 0 resources, unless they are added from the agencies end.
DOJ: http://catalog.data.gov/dataset?organization=doj-gov&q=num_resources%3A0
DHS: http://catalog.data.gov/dataset?q=num_resources%3A0&organization=dhs-gov
@philipashlock few more observation on the reharvest for the datasets which have been set to source_hash:""
A) Impacts around 97% datasets
Most important one NASA Json harvest source is not there is production anymore, which is impacting the 2378 datasets with source_hash:"" not to be updated, since the harvest source is not available to restart, we would need to restore the resources where available from the source json object present in the database (this will require a bit more work)
B) Following agency json files have few schema issues, which is causing the datasets with source_harsh not to be updated during the harvest process:
NSF - http://www.nsf.gov/data.json (32 datasets are impacted) 1 Error loading json content: No JSON object could be decoded.
USAID - http://www.usaid.gov/data.json (8 datasets are impacted) 1 Error loading json content: Expecting property name: line 993 column 27 (char 57068).
C) HHS json file (http://healthdata.gov/data.json) has validation errors on around 40 datasets which was impacting the 22 datasets with source_hash set to ""
D) State of NY json file (https://data.ny.gov/data.json?version=2) harvest process has been struck in production for last 3 weeks, which is preventing the update of the 4 datasets with source_hash set to ""
The fix is in production, main NASA harvest source (A) data is resolved.
B, C, D from are still pending needs to be coordinated with respective agencies
Most of them resolved now, we can close the issue when this URL returns 0 datasets http://catalog.data.gov/dataset?q=num_resources%3A0+and+source_hash%3A%22%22&sort=score+desc%2C+name+asc
Pending agencies to followup on their data.json file errors, some these are datasets have zero download resources defined in the source json files.
C) HHS json file (http://healthdata.gov/data.json) has validation errors on around 30 datasets which was impacting the 14 datasets with source_hash set to ""
D) State of NY json file (https://data.ny.gov/data.json?version=2) harvest process has been struck in production for last 3 weeks, which is preventing the update of the 4 datasets with source_hash set to ""
There are only four datasets now from State of NY, once those are addressed we can close the issue.
NY City Json latest harvest job finished, and no dataset reset source hash, we can close the ticket now.
Here's one example where no downloadable resources are listed
http://catalog.data.gov/dataset/best-of-charts-of-note-2013
But if you look at the source json metadata you can see that it does specify an accessURL: http://catalog.data.gov/harvest/object/e06ac937-5b65-4acc-8707-a96ccf465164
For USDA alone, it looks like there are 107 datasets with no resources shown in CKAN while their data.json analysis suggests there should only be 28.