ckan / datapusher

A standalone web service that pushes data files from a CKAN site resources into its DataStore
GNU Affero General Public License v3.0
77 stars 155 forks source link

Explain how it datapusher works and add API documentation #18

Open rufuspollock opened 10 years ago

rufuspollock commented 10 years ago

Need to add the following to the documentation:

nigelbabu commented 10 years ago

@rgrp The datapusher uses ckanservice provider. It is run independently of CKAN and uses the API. As for the documentation for the API, ckanserviceprovider should give you an idea; further pull requests welcome. I've updated the issue slightly to read as a bug.

amercader commented 10 years ago

These are really good questions and really timely as we are working on the DataPusher docs before the release.

As @nigelbabu mentions, the DataPusher is a standalone application (although generally installed in the same server) and all communication with CKAN core and the DataStore is done via HTTP.

  1. CKAN talks with the DataPusher using the CKAN Service Provider protocol, telling him "Please, upload this resource to the DataStore". The request sent is something like:

    http POST http://localhost:8800/job Content-Type:application/json < dp.json
    {
    "api_key": "XXXXXX",
    "job_type": "push_to_datastore",
    "result_url": "http://localhost:5000/api/3/action/datapusher_hook",
    "metadata": {
       "ckan_url": "http://localhost:5000",
       "resource_id": "08872bf2-c620-4555-97ed-18e9f874a314"
    }   
    }

    You can of course send these requests from another client.

  2. Once the job is created, the DataPusher will request the remote file contents, process them and push them to the DataStore via the datastore_create action.

Here's a simple schema of the whole process in case it helps:

glasgow workshop

We'll try and improve the docs with these details.

rufuspollock commented 10 years ago

Also I now understand this is an instance of CKAN Service Provider and follows it docs.

The actual job type is push_to_datastore. Example code grabbed from ckanext-datapusherext is:

    requests.post(
        urlparse.urljoin(datapusher_url, 'job'),
        headers={
            'Content-Type': 'application/json'
        },
        data=json.dumps({
            'api_key': user['apikey'],
            'job_type': 'push_to_datastore',
            'result_url': callback_url,
            'metadata': {
                'ckan_url': pylons.config['ckan.site_url'],
                'resource_id': res_id,
                'set_url_type': data_dict.get('set_url_type', False)
            }
        }))
florianm commented 10 years ago

For us non-core developers, it would be great to have some docs on the requests sent between datapusher and the CKAN API. It is relevant to deployment behind firewalls and proxies to understand that datapusher will send HTTP requests to the ckan.site_url, which must pass firewall and proxy.

It's quite tricky to figure out why perfectly fine datapusher gets a mysterious "could not post to result_url" from a perfectly fine CKAN API. Of course this is not a problem of datapusher per se, but it's in the nature of CKAN/datapusher that they will get installed for bigger audiences, often on cloud services with weird and wonderful proxy and firewall settings. I'm happy to contribute a section on using curl to debug failing http requests between datapusher and the CKAN API if that's any good!

smrgeoinfo commented 9 years ago

+1 on documentation for this HTTP traffic-- we have been stuck for several weeks trying to figure out why datapusher and harvesting aren't working on our deployments. It's cost US A LOT of money. https://github.com/ngds/ckanext-ngds/issues/580

florianm commented 9 years ago

update for those stuck between their firewall and a hard place: multi-tenant setup from source (should also work for single-tenant installs) and a diagram illustrating HTTP traffic crossing the installation localhost's boundaries.

Also worth reading is boxkite's setup.