ckan / ckanext-dcat

CKAN ♥ DCAT
https://docs.ckan.org/projects/ckanext-dcat
167 stars 146 forks source link

Install only de DCAT Harvester #14

Closed montxo5 closed 10 years ago

montxo5 commented 10 years ago

I'm very interested in this Extension, but specialy in the DCAT Harvester. Would it be possible to only install this function, and how can I do it? Thanks in advance.

amercader commented 10 years ago

You can't install individual plugins, but this extension is basically centred around the DCAT harvester so you want to install the whole lot anyway.

I've added some install instructions:

https://github.com/ckan/ckanext-dcat#install

montxo5 commented 10 years ago

Thank you. I've installed it, and configure a Harvester for a XML RDF, but in the gather_consumer.log it shows this error: ERROR [ckanext.harvest.queue] No harvester could be found for source type dcat_xml

It seems that the queue can't find the harvester for RDF_XML.

montxo5 commented 10 years ago

Sorry, my fault. I didn't restart the supervisor... Now the JSON harvester is working, but the XML is always crashing with this error in the fetch_consumer: ValueError: The provided document does not seem to contain a dcat:Dataset element

I've also tried with the example files. Thanks in advance.

Full trace: File "/usr/lib/ckan/default/bin/paster", line 9, in load_entry_point('PasteScript==1.7.5', 'console_scripts', 'paster')() File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/paste/script/command.py", line 104, in run invoke(command, command_name, options, args[1:]) File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/paste/script/command.py", line 143, in invoke exit_code = runner.run(args) File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/paste/script/command.py", line 238, in run result = self.command() File "/usr/lib/ckan/default/src/ckanext-harvest/ckanext/harvest/commands/harvester.py", line 127, in command fetch_callback(consumer, method, header, body) File "/usr/lib/ckan/default/src/ckanext-harvest/ckanext/harvest/queue.py", line 294, in fetch_callback fetch_and_import_stages(harvester, obj) File "/usr/lib/ckan/default/src/ckanext-harvest/ckanext/harvest/queue.py", line 311, in fetch_and_import_stages success_import = harvester.import_stage(obj) File "/usr/lib/ckan/default/src/ckanext-dcat/ckanext/dcat/harvesters.py", line 305, in import_stage package_dict, dcat_dict = self._get_package_dict(harvest_object) File "/usr/lib/ckan/default/src/ckanext-dcat/ckanext/dcat/harvesters.py", line 398, in _get_package_dict dcat_dict = dataset.read_values() File "/usr/lib/ckan/default/src/ckanext-dcat/ckanext/dcat/formats/xml.py", line 26, in read_values tree = self.get_xml_tree() File "/usr/lib/ckan/default/src/ckanext-dcat/ckanext/dcat/formats/xml.py", line 58, in get_xml_tree raise ValueError('The provided document does not seem to contain a {0} element'.format(self.base_class)) ValueError: The provided document does not seem to contain a dcat:Dataset element

amercader commented 10 years ago

@montxo5 this looks like a bug in the XML parsing. I'll try and push the fix in the next couple of days

amercader commented 10 years ago

@montxo5 can you see if the latest changes in d289c5871 fix the issue?

montxo5 commented 10 years ago

Thank you very much! Now its working perfect with your example.

I'm trying with other DCAT from an Open Data Portal of Madrid. The import for datasets works fine, but with the resources it ignoring it. Its only creating empty datasets.

The RDF is here: http://datos.madrid.es/egob/catalogo.rdf Maybe the RDF they publish it's not correct, could it be? Thanks.

amercader commented 10 years ago

Hi @montxo5.

In DCAT land, the distributions are defined using the dcat:Distribution class. So for example, if you are using XML/RDF:

<dcat:distribution>
  <dcat:Distribution>
    <dct:title xml:lang="es">Consultas ciudadanas (2004-2013)</dct:title>
    <!-- ... -->
  </dcat:Distribution>
</dcat:distribution>

Note that the Madrid portal is using the dcat:Download class, which AFAICT does not exist:

<dcat:distribution>
  <dcat:Download>
    <dct:title xml:lang="es">Consultas ciudadanas (2004-2013)</dct:title>
    <!-- ... -->
  </dcat:Download>
</dcat:distribution>

We followed the recommendations of the DCAT Application Profile for Data Portals in Europe as basis for our support for harvesting DCAT based documents, in case you want to have a reference.

Also, check the examples folder of this extension to see the serializations supported.

Hope this helps.

montxo5 commented 10 years ago

Thank you very much! you're right. I will try to concact with Madrid's Open portal to explain it. For your information, we're trying to use this extension for a BigOpenPlatform to use it in a datathon event with Madrid's city hall called MADdata. If you are interested, or if you know someone, please check this page: http://maddata.es/ If we finally use this extension, we will mention it in the presentation. Thanks.

amercader commented 10 years ago

That looks great @montxo5, hope it's a good one in Madrid!

Closing the issue now