Check validity of dataset links to sites

jonroberts commented 11 years ago

We need a crawler to continually check the validity of links in the site and highlight any broken links. This could be a native python function inside django, or (probably better) a cron job in whatever language.

valtsymbal commented 11 years ago

What this job should do:

Select from url table.
java/py to iterate thru the list.
when data not there: I can put into separate badurls table or we need to add column "valid" yes/no to existing urls table and update it then probably the bad urls should not be displayed on web until url resolved.
notify admin or developer about bad urls for resolution
should run every morning like 6am

what else..or what not...

On Thu, Sep 5, 2013 at 11:18 PM, jonroberts notifications@github.comwrote:

We need a crawler to continually check the validity of links in the site and highlight any broken links. This could be a native python function inside django, or (probably better) a cron job in whatever language.

— Reply to this email directly or view it on GitHubhttps://github.com/CleanData/data-network/issues/14 .

jonroberts commented 11 years ago

Absolutely.

There are two approaches I see to this:

Provide an API endpoint that lists all the dataset urls, that can be checked using an independent program on this server - or another.
Write an integrated Django module that runs regularly and writes the details to a table.

The first is the quickest. I'll set up an API endpoint that provides all the URLs. That's going to be useful for the dataset scraping too, as it provides a way to see whether a dataset exists or not already at creation time.

jonroberts commented 11 years ago

Okay, both the test server (cleandata.jrsandbox.com) and the production server (cleandatahub.org) have been updated to have a dataset url query in the API.

To get a list of the existing urls in the database:

http://cleandatahub.org/api/v1/dataset_url/?format=json&limit=100&offset=100

The arguments to the API call are:

limit - the number of records to return in a single call
offset - the number of records to skip - this provides pagination of results

The return value of the API call has a previous and a next field for moving through the paginated results.

The API call can also be used to check for the existence of a particular dataset. To search for an exact match:

http://cleandatahub.org/api/v1/dataset_url/?format=json&url=https://data.cityofnewyork.us/Transportation/Medallion-Drivers/iux8-53rc

This will return the dataset in the format:

{
  meta: {
    limit: 20,
    next: null,
    offset: 0,
    previous: null,
    total_count: 1
  },
  objects: [
    {
      resource_uri: "/api/v1/dataset_url/153/",
      url: "https://data.cityofnewyork.us/Transportation/Medallion-Drivers/iux8-53rc"
    }
  ]
}

(if it exist) or an empty array of objects if it doesn't. Note that you can also query the full datasets:

http://cleandatahub.org/api/v1/dataset/?format=json&url=https://data.cityofnewyork.us/Transportation/Medallion-Drivers/iux8-53rc

CleanData / data-network

Check validity of dataset links to sites #14