Open jonroberts opened 11 years ago
What this job should do:
what else..or what not...
On Thu, Sep 5, 2013 at 11:18 PM, jonroberts notifications@github.comwrote:
We need a crawler to continually check the validity of links in the site and highlight any broken links. This could be a native python function inside django, or (probably better) a cron job in whatever language.
— Reply to this email directly or view it on GitHubhttps://github.com/CleanData/data-network/issues/14 .
Absolutely.
There are two approaches I see to this:
The first is the quickest. I'll set up an API endpoint that provides all the URLs. That's going to be useful for the dataset scraping too, as it provides a way to see whether a dataset exists or not already at creation time.
Okay, both the test server (cleandata.jrsandbox.com) and the production server (cleandatahub.org) have been updated to have a dataset url query in the API.
To get a list of the existing urls in the database:
http://cleandatahub.org/api/v1/dataset_url/?format=json&limit=100&offset=100
The arguments to the API call are:
limit - the number of records to return in a single call
offset - the number of records to skip - this provides pagination of results
The return value of the API call has a previous
and a next
field for moving through the paginated results.
The API call can also be used to check for the existence of a particular dataset. To search for an exact match:
http://cleandatahub.org/api/v1/dataset_url/?format=json&url=https://data.cityofnewyork.us/Transportation/Medallion-Drivers/iux8-53rc
This will return the dataset in the format:
{
meta: {
limit: 20,
next: null,
offset: 0,
previous: null,
total_count: 1
},
objects: [
{
resource_uri: "/api/v1/dataset_url/153/",
url: "https://data.cityofnewyork.us/Transportation/Medallion-Drivers/iux8-53rc"
}
]
}
(if it exist) or an empty array of objects if it doesn't. Note that you can also query the full datasets:
http://cleandatahub.org/api/v1/dataset/?format=json&url=https://data.cityofnewyork.us/Transportation/Medallion-Drivers/iux8-53rc
We need a crawler to continually check the validity of links in the site and highlight any broken links. This could be a native python function inside django, or (probably better) a cron job in whatever language.