GSA / datagov-wptheme

Data.gov WordPress Theme (obsolete)
https://www.data.gov
Other
1.88k stars 411 forks source link

Some Data and Resources not showing up from data.json harvests #457

Closed philipashlock closed 9 years ago

philipashlock commented 10 years ago

Here's one example where no downloadable resources are listed

http://catalog.data.gov/dataset/best-of-charts-of-note-2013

But if you look at the source json metadata you can see that it does specify an accessURL: http://catalog.data.gov/harvest/object/e06ac937-5b65-4acc-8707-a96ccf465164

For USDA alone, it looks like there are 107 datasets with no resources shown in CKAN while their data.json analysis suggests there should only be 28.

FuhuXia commented 10 years ago

Not sure how it happened. Can't blame the code because if we start everything fresh for the same harvest source, the accessURL will be harvested fine.

If we already have a list of the datasets to be corrected, we can use api to reset the source_hash for the listed dataset, then a reharvest will correct the error.

If we don't have the list yet, we should be able to run a query on the db server to list all the problematic dataset, then reset source_hash and reharvest them.

philipashlock commented 10 years ago

@FuhuXia Perhaps this should be filtered by harvest source type, but you can just use the num_resources:0 filter to see these - http://catalog.data.gov/dataset?q=num_resources%3A0

The 107 link was just for USDA and there it seems like it's affecting all the ones not accounted for by the dashboard (78 of 107)

FuhuXia commented 10 years ago

All datajson datasets with num_resources:0 have been set to source_hash:"", so they will be updated by next harvest job as scheduled.

kvuppala commented 10 years ago

@FuhuXia @philipashlock Looks like DOJ and DHS agency json contains datasets with no accessURL, so we will still have some legitimate entries with 0 resources, unless they are added from the agencies end.

DOJ: http://catalog.data.gov/dataset?organization=doj-gov&q=num_resources%3A0

DHS: http://catalog.data.gov/dataset?q=num_resources%3A0&organization=dhs-gov

kvuppala commented 10 years ago

@philipashlock few more observation on the reharvest for the datasets which have been set to source_hash:""

A) Impacts around 97% datasets

Most important one NASA Json harvest source is not there is production anymore, which is impacting the 2378 datasets with source_hash:"" not to be updated, since the harvest source is not available to restart, we would need to restore the resources where available from the source json object present in the database (this will require a bit more work)

B) Following agency json files have few schema issues, which is causing the datasets with source_harsh not to be updated during the harvest process:

NSF - http://www.nsf.gov/data.json (32 datasets are impacted) 1 Error loading json content: No JSON object could be decoded.

USAID - http://www.usaid.gov/data.json (8 datasets are impacted) 1 Error loading json content: Expecting property name: line 993 column 27 (char 57068).

C) HHS json file (http://healthdata.gov/data.json) has validation errors on around 40 datasets which was impacting the 22 datasets with source_hash set to ""

D) State of NY json file (https://data.ny.gov/data.json?version=2) harvest process has been struck in production for last 3 weeks, which is preventing the update of the 4 datasets with source_hash set to ""

kvuppala commented 10 years ago

The fix is in production, main NASA harvest source (A) data is resolved.

B, C, D from are still pending needs to be coordinated with respective agencies

kvuppala commented 9 years ago

Most of them resolved now, we can close the issue when this URL returns 0 datasets http://catalog.data.gov/dataset?q=num_resources%3A0+and+source_hash%3A%22%22&sort=score+desc%2C+name+asc

Pending agencies to followup on their data.json file errors, some these are datasets have zero download resources defined in the source json files.

C) HHS json file (http://healthdata.gov/data.json) has validation errors on around 30 datasets which was impacting the 14 datasets with source_hash set to ""

D) State of NY json file (https://data.ny.gov/data.json?version=2) harvest process has been struck in production for last 3 weeks, which is preventing the update of the 4 datasets with source_hash set to ""

kvuppala commented 9 years ago

There are only four datasets now from State of NY, once those are addressed we can close the issue.

870d73a6_o 1

kvuppala commented 9 years ago

NY City Json latest harvest job finished, and no dataset reset source hash, we can close the ticket now.

http://catalog.data.gov/dataset?q=num_resources%3A0+and+source_hash%3A%22%22&sort=score+desc%2C+name+asc

05c81b7b_o 1