dice-group / Squirrel

Squirrel searches and collects Linked Data
Other
22 stars 19 forks source link

Enable usage of CKAN API #37

Open MichaelRoeder opened 6 years ago

MichaelRoeder commented 6 years ago

Goal

The crawler should be able to use the API of a CKAN instead of crawling all its HTML pages.

Solution

There should be a special CKAN analyzer which is able to

Aklakan commented 6 years ago

Hi Michael, I just noticed this issue - for the QROWD project we are working on this dcat tool, which already supports basic uploads and downloads from CKAN (besides loading datasets into virtuoso). Right now I am currently extending this tool to cover all of DCAT-AP - including the bindings for CKAN (based on this mapping table: http://extensions.ckan.org/extension/dcat/#rdf-dcat-to-ckan-dataset-mapping).

earthquakesan commented 6 years ago

imo, using ckanapi is the best option.

For the better integration with Squirrel, ckanapi python library should be wrapped in a way that it can be used directly from Java code.

Aklakan commented 6 years ago

I am using Jackan and overall it works (and with a little workaround I got file upload working) - the main issue is, that they model the CKAN fields as Java beans - instead of using a generic Json object, and wrapping it as a Java view. I have the impression that their bean model does support most - but not all - fields.

What they eventually do is to marshal the Java bean as json and send it to the CKAN Web API.

MichaelRoeder commented 6 years ago

Jackan seems to had its last dev commit in Nov 2016 :confused:

varuneranki commented 6 years ago

Hi, i have tried integrating ckan-aggregator-py using jython and ckanapi. Jython can only handle http get requests and ckanapi needs https get requests. https support was added to jython 2.7.1 but it fails so tried on a windows machine but still i get ssl error. urllib3 causes this ssl issue while a https call was made. Also i have read a discussion that jython might cause more security issues with java and support can be added for jython 2.7.2. imports sometimes do not work using apt-get install jython; tried with a non standalone jar installation

Ckanapi using sudo runs on python; without on jython. Strangely if jython is added to sudo it randomly chooses python/jython (mostly jython and fails at https request).

Ckanapi with python is able to dump datasets but fails to pack data into a jsonl.xz resulting in a empty json file. Since i am only using ckanapi for past two weeks i have no understanding of making it work properly.

Also data.gov has it's own issue and cannot handle dump datasets -all Solution: implementing ckanapi in python; send and recieve objects or using python java bridge might work. Since we want to dump all datasets it won't be a problem of handling a specific dataset. i'm testing to dump a json file and why is it failing to stop the process unless keyboard interrupt is given? why_it_fails_to_stop.png

earthquakesan commented 6 years ago

It seems like there is a problem in handling worker pool, the way it is implemented in ckanapi:

Traceback (most recent call last):
  File "/home/ivan/opt/jython2.7.1/bin/ckanapi", line 10, in <module>
    sys.exit(
  File "/home/ivan/opt/jython2.7.1/Lib/site-packages/ckanapi-4.1-py2.7.egg/ckanapi/cli/main.py", line 129, in main
    return dump_things(ckan, thing[0], arguments)
  File "/home/ivan/opt/jython2.7.1/Lib/site-packages/ckanapi-4.1-py2.7.egg/ckanapi/cli/dump.py", line 83, in dump_things
    for job_ids, finished, result in pool:
  File "/home/ivan/opt/jython2.7.1/Lib/site-packages/ckanapi-4.1-py2.7.egg/ckanapi/cli/workers.py", line 93, in worker_pool
    readable, _, _ = select.select(worker_fds, [], [])
  File "/home/ivan/opt/jython2.7.1/Lib/_socket.py", line 1537, in select
    return _Select(rlist, wlist, xlist)(timeout)
  File "/home/ivan/opt/jython2.7.1/Lib/_socket.py", line 435, in __call__
    started = time.time()
  File "/home/ivan/opt/jython2.7.1/Lib/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/home/ivan/opt/jython2.7.1/Lib/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/home/ivan/opt/jython2.7.1/Lib/_socket.py", line 427, in _register_sockets
    socks = self._normalize_sockets(socks)
  File "/home/ivan/opt/jython2.7.1/Lib/_socket.py", line 415, in _normalize_sockets
    raise error(errno.EBADF, "Bad file descriptor: %s" % (sock,))
_socket.error: [Errno 9] Bad file descriptor: <open file '<fdopen>', mode 'rb' at 0x19f>

@varunmaitreya you need try to use ckanapi directly without CLI.