datopian / ckanext-sweden

CKAN extension for Öppnadata.se, the Swedish data management platform
GNU Affero General Public License v3.0
7 stars 4 forks source link

harvester error #15

Open joetsoi opened 8 years ago

joetsoi commented 8 years ago
2015-07-20 19:18:41,307 DEBUG [ckanext.harvest.model] Harvest tables defined in memory
2015-07-20 19:18:41,310 DEBUG [ckanext.harvest.model] Harvest tables already exist
2015-07-20 19:18:41,403 DEBUG [ckanext.harvest.queue] Gather queue consumer registered
2015-07-20 19:18:41,404 DEBUG [ckanext.harvest.queue] Received harvest job id: 64887a99-18c8-436f-bd4b-e580d3ed632d
2015-07-20 19:18:41,410 DEBUG [ckanext.dcat.harvesters.rdf] In DCATRDFHarvester gather_stage
2015-07-20 19:18:41,410 DEBUG [ckanext.dcat.harvesters.base] Getting file http://www.arjang.se/datasets/dcat
2015-07-20 19:18:41,620 ERROR [ckanext.harvest.harvesters.base] Could not get content. Server responded with 404
2015-07-20 19:18:41,621 INFO  [ckanext.sweden.dcat.plugin] after download
2015-07-20 19:18:41,922 INFO  [ckanext.sweden.dcat.plugin] 400
2015-07-20 19:18:41,932 INFO  [ckanext.sweden.dcat.plugin] {"errors": true, "rdfError": "No RDF detected."}
2015-07-20 19:18:41,947 ERROR [ckanext.harvest.harvesters.base] The validation service returned an error: 400
2015-07-20 19:18:41,960 ERROR [ckanext.harvest.queue] Gather stage failed
2015-07-20 19:18:41,961 DEBUG [ckanext.harvest.queue] Received harvest job id: eed84020-529c-4bf5-a7f3-e2b1923ad122
2015-07-20 19:18:41,967 DEBUG [ckanext.dcat.harvesters.rdf] In DCATRDFHarvester gather_stage
2015-07-20 19:18:41,967 DEBUG [ckanext.dcat.harvesters.base] Getting file http://www.arjeplog.se/datasets/dcat
2015-07-20 19:18:45,088 ERROR [ckanext.harvest.harvesters.base] Could not get content. Server responded with 404
2015-07-20 19:18:45,089 INFO  [ckanext.sweden.dcat.plugin] after download
2015-07-20 19:18:45,255 INFO  [ckanext.sweden.dcat.plugin] 400
2015-07-20 19:18:45,258 INFO  [ckanext.sweden.dcat.plugin] {"errors": true, "rdfError": "No RDF detected."}
2015-07-20 19:18:45,273 ERROR [ckanext.harvest.harvesters.base] The validation service returned an error: 400
2015-07-20 19:18:45,278 ERROR [ckanext.harvest.queue] Gather stage failed
2015-07-20 19:18:45,279 DEBUG [ckanext.harvest.queue] Received harvest job id: 6ae38d4d-5f60-462e-abd6-e882cb5e7319
2015-07-20 19:18:45,286 DEBUG [ckanext.dcat.harvesters.rdf] In DCATRDFHarvester gather_stage
2015-07-20 19:18:45,287 DEBUG [ckanext.dcat.harvesters.base] Getting file http://www.arkitekturmuseet.se/datasets/dcat
2015-07-20 19:18:45,396 ERROR [ckanext.harvest.harvesters.base] Could not get content because a
                                connection error occurred. HTTPConnectionPool(host='www.arkitekturmuseet.se', port=80): Max retries exceeded with url: /datasets/dcat (Caused by <class 'socket.gaierror'>: [Errno -2] Name or service not known)
2015-07-20 19:18:45,397 INFO  [ckanext.sweden.dcat.plugin] after download
2015-07-20 19:18:45,570 INFO  [ckanext.sweden.dcat.plugin] 400
2015-07-20 19:18:45,572 INFO  [ckanext.sweden.dcat.plugin] {"errors": true, "rdfError": "No RDF detected."}
2015-07-20 19:18:45,580 ERROR [ckanext.harvest.harvesters.base] The validation service returned an error: 400
2015-07-20 19:18:45,588 ERROR [ckanext.harvest.queue] Gather stage failed
2015-07-20 19:18:45,590 DEBUG [ckanext.harvest.queue] Received harvest job id: 0b674a60-fccd-43a5-8c73-20b4cf687389
2015-07-20 19:18:45,599 DEBUG [ckanext.dcat.harvesters.rdf] In DCATRDFHarvester gather_stage
2015-07-20 19:18:45,600 DEBUG [ckanext.dcat.harvesters.base] Getting file http://www.arn.se/datasets/dcat
2015-07-20 19:18:45,811 INFO  [ckanext.sweden.dcat.plugin] after download
2015-07-20 19:18:46,056 INFO  [ckanext.sweden.dcat.plugin] 400
2015-07-20 19:18:46,058 INFO  [ckanext.sweden.dcat.plugin] {"errors": true, "rdfError": "No RDF detected."}
2015-07-20 19:18:46,063 ERROR [ckanext.harvest.harvesters.base] The validation service returned an error: 400

Traceback (most recent call last):
  File "/usr/lib/ckan/default/bin/paster", line 9, in <module>
    load_entry_point('PasteScript==1.7.5', 'console_scripts', 'paster')()
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/paste/script/command.py", line 104, in run
    invoke(command, command_name, options, args[1:])
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/paste/script/command.py", line 143, in invoke
    exit_code = runner.run(args)
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/paste/script/command.py", line 238, in run
    result = self.command()
  File "/usr/lib/ckan/default/src/ckanext-harvest/ckanext/harvest/commands/harvester.py", line 135, in command
    gather_callback(consumer, method, header, body)
  File "/usr/lib/ckan/default/src/ckanext-harvest/ckanext/harvest/queue.py", line 230, in gather_callback
    harvest_object_ids = harvester.gather_stage(job)
  File "/usr/lib/ckan/default/src/ckanext-dcat/ckanext/dcat/harvesters/rdf.py", line 167, in gather_stage
    parser.parse(content)
  File "/usr/lib/ckan/default/src/ckanext-dcat/ckanext/dcat/processors.py", line 129, in parse
    self.g.parse(data=data, format=_format)
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/rdflib/graph.py", line 1035, in parse
    parser.parse(source, self, **args)
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/rdflib/plugins/parsers/rdfxml.py", line 577, in parse
    self._parser.parse(source)
  File "/usr/lib/python2.7/xml/sax/expatreader.py", line 107, in parse
    xmlreader.IncrementalParser.parse(self, source)
  File "/usr/lib/python2.7/xml/sax/xmlreader.py", line 123, in parse
    self.feed(buffer)
  File "/usr/lib/python2.7/xml/sax/expatreader.py", line 207, in feed
    self._parser.Parse(data, isFinal)
  File "/usr/lib/python2.7/xml/sax/expatreader.py", line 349, in end_element_ns
    self._cont_handler.endElementNS(pair, None)
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/rdflib/plugins/parsers/rdfxml.py", line 160, in endElementNS
    self.current.end(name, qname)
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/rdflib/plugins/parsers/rdfxml.py", line 331, in node_element_end
    self.error("Repeat node-elements inside property elements: %s"%"".join(name))
TypeError: sequence item 0: expected string, NoneType found
joetsoi commented 8 years ago

@amercader, looks like we've triggered an error in rdflib itself as name is None so it raises a TypeError instead of an rdflib exception and the harvester borks out. i suspect it's the data that's the problem