ckan / ckanext-dcat

CKAN ♥ DCAT
https://docs.ckan.org/projects/ckanext-dcat
168 stars 148 forks source link

Add command to dump all datasets serialized as RDF #150

Open amercader opened 5 years ago

amercader commented 5 years ago

We are cleaning up the CKAN core CLI (https://github.com/ckan/ckan/issues/4639) and there is the rdf-export command that dumps all datasets in the site as RDF in a folder.

The implementation isn't particularly great and will probably fail on large sites, but is this a useful command to have anyway?

I'm curious to hear people's thoughts.

class RDFExport(CkanCommand):
    '''Export active datasets as RDF
    This command dumps out all currently active datasets as RDF into the
    specified folder.

    Usage:
      paster rdf-export /path/to/store/output
    '''
    summary = __doc__.split('\n')[0]
    usage = __doc__

    def command(self):
        self._load_config()

        if not self.args:
            # default to run
            print(RDFExport.__doc__)
        else:
            self.export_datasets(self.args[0])

    def export_datasets(self, out_folder):
        '''
        Export datasets as RDF to an output folder.
        '''
        from ckan.common import config
        import ckan.model as model
        import ckan.logic as logic
        import ckan.lib.helpers as h 

        # Create output folder if not exists
        if not os.path.isdir(out_folder):
            os.makedirs(out_folder)

        fetch_url = config['ckan.site_url']
        user = logic.get_action('get_site_user')({'model': model, 'ignore_auth': True}, {})
        context = {'model': model, 'session': model.Session, 'user': user['name']}
        dataset_names = logic.get_action('package_list')(context, {})
        for dataset_name in dataset_names:
            dd = logic.get_action('package_show')(context, {'id': dataset_name})
            if not dd['state'] == 'active':
                continue

            url = h.url_for('dataset.read', id=dd['name'])

            url = urljoin(fetch_url, url[1:]) + '.rdf'
            try:
                fname = os.path.join(out_folder, dd['name']) + ".rdf"
                try:
                    r = urlopen(url).read()
                except HTTPError as e:
                    if e.code == 404:
                        error('Please install ckanext-dcat and enable the ' +
                              '`dcat` plugin to use the RDF serializations')
                with open(fname, 'wb') as f:
                    f.write(r)
            except IOError as ioe:
                sys.stderr.write(str(ioe) + "\n")
metaodi commented 5 years ago

I think it's useful. If we include it in this extension two parameters should be added: serialization format (xml, json-ld, turtle, ...) and profile.

jvanulde commented 5 years ago

Can it be added to the API?

GKStGovData commented 1 year ago

This function is useful. While missing in the API @knudmoeller wrote it's own dowload module https://github.com/berlinonline/dcat-catalog-downloader