Initial setup - Githubissues

etj commented 3 years ago

We need a docker composition that will bring up all of the required services:

CKAN (with all related extensions)
Solr
Redis
gather-consumer
fetch-consumer
datapusher
DB (should be made externalizable)

Some references:

https://github.com/geosolutions-it/dati-ckan-docker This is an older repo for this very purpose. It has various issues anyway, just take it as a reference but do not follow its implementation
https://github.com/geosolutions-it/C195-azure-workspace/tree/master/ckan-docker A similar work has been implemented in this repo, where many services are already set up.
https://docs.google.com/document/d/1Fxfc-DeX8uevt7pO8Z9NQM-XJequEVMn6wkwhfGLWEc/edit#heading=h.hlpf5177p9j2 A doc for installing CKAN 2.9 from scratch on a local environment. Its steps should be run in the CKAN dockerfile.

lpasquali commented 3 years ago

hello @etj @randomorder first implemenation skeleton has been done, as per communitan with @etj multilang and dcatapit plutings are yet to be installed and enabled by composition as they are not working right now with this ckan version.

ckan, redis, solr, postgres look like to be fine. But I do not understand why, as it has very little configurations to be made, datapusher is not willing to work and fails in moving data from they file/url source to the datastore db, I also tried a more recent docker image of it. Could you try to run the composition @etj and see if you spot what's the issue with datapusher?

here is datapusher log after I try to upload this example csv

datapusher    | *** Starting uWSGI 2.0.19.1 (64bit) on [Mon Nov  8 08:35:12 2021] ***
datapusher    | compiled with version: 10.2.1 20201203 on 03 August 2021 04:54:30
datapusher    | os: Linux-5.11.0-38-generic #42~20.04.1-Ubuntu SMP Tue Sep 28 20:41:07 UTC 2021
datapusher    | nodename: 85f70f3f126d
datapusher    | machine: x86_64
datapusher    | clock source: unix
datapusher    | detected number of CPU cores: 8
datapusher    | current working directory: /srv/app
datapusher    | detected binary path: /usr/bin/uwsgi
datapusher    | !!! no internal routing support, rebuild with pcre support !!!
datapusher    | your memory page size is 4096 bytes
datapusher    | detected max file descriptor number: 1048576
datapusher    | - async cores set to 2000 - fd table size: 1048576
datapusher    | lock engine: pthread robust mutexes
datapusher    | thunder lock: disabled (you can enable it with --thunder-lock)
datapusher    | uWSGI http bound on :8800 fd 3
datapusher    | uwsgi socket 0 bound to UNIX address /tmp/uwsgi.sock fd 6
datapusher    | Python version: 3.8.10 (default, May  6 2021, 00:05:59)  [GCC 10.2.1 20201203]
datapusher    | Python main interpreter initialized at 0x7faf81765570
datapusher    | python threads support enabled
datapusher    | your server socket listen backlog is limited to 100 connections
datapusher    | your mercy for graceful operations on workers is 60 seconds
datapusher    | mapped 62923392 bytes (61448 KB) for 4000 cores
datapusher    | *** Operational MODE: preforking+async ***
datapusher    | WSGI app 0 (mountpoint='') ready in 1 seconds on interpreter 0x7faf81765570 pid: 1 (default app)
datapusher    | *** uWSGI is running in multiple interpreter mode ***
datapusher    | spawned uWSGI master process (pid: 1)
datapusher    | spawned uWSGI worker 1 (pid: 11, cores: 2000)
datapusher    | spawned uWSGI worker 2 (pid: 12, cores: 2000)
datapusher    | spawned uWSGI http 1 (pid: 13)
datapusher    | *** running gevent loop engine [addr:0x56456555fd30] ***
datapusher    | Fetching from: https://raw.githubusercontent.com/datopian/CKAN_Demo_Datasets/main/resources/org1_sample.csv
datapusher    | Error notifying listener
datapusher    | Traceback (most recent call last):
datapusher    |   File "/usr/lib/python3.8/site-packages/apscheduler/scheduler.py", line 512, in _run_job
datapusher    |     retval = job.func(*job.args, **job.kwargs)
datapusher    |   File "/usr/lib/python3.8/site-packages/datapusher/jobs.py", line 450, in push_to_datastore
datapusher    |     existing = datastore_resource_exists(resource_id, api_key, ckan_url)
datapusher    |   File "/usr/lib/python3.8/site-packages/datapusher/jobs.py", line 237, in datastore_resource_exists
datapusher    |     raise HTTPError(
datapusher    | datapusher.jobs.HTTPError: <unprintable HTTPError object>
datapusher    | 
datapusher    | During handling of the above exception, another exception occurred:
datapusher    | 
datapusher    | Traceback (most recent call last):
datapusher    |   File "/usr/lib/python3.8/site-packages/apscheduler/scheduler.py", line 239, in _notify_listeners
datapusher    |     cb(event)
datapusher    |   File "/usr/lib/python3.8/site-packages/ckanserviceprovider/web.py", line 189, in job_listener
datapusher    |     db.mark_job_as_errored(job_id, error_object)
datapusher    |   File "/usr/lib/python3.8/site-packages/ckanserviceprovider/db.py", line 413, in mark_job_as_errored
datapusher    |     _update_job(job_id, update_dict)
datapusher    |   File "/usr/lib/python3.8/site-packages/ckanserviceprovider/db.py", line 348, in _update_job
datapusher    |     job_dict["error"] = json.dumps(job_dict["error"])
datapusher    |   File "/usr/lib/python3.8/json/__init__.py", line 231, in dumps
datapusher    |     return _default_encoder.encode(obj)
datapusher    |   File "/usr/lib/python3.8/json/encoder.py", line 199, in encode
datapusher    |     chunks = self.iterencode(o, _one_shot=True)
datapusher    |   File "/usr/lib/python3.8/json/encoder.py", line 257, in iterencode
datapusher    |     return _iterencode(o, 0)
datapusher    |   File "/usr/lib/python3.8/json/encoder.py", line 179, in default
datapusher    |     raise TypeError(f'Object of type {o.__class__.__name__} '
datapusher    | TypeError: Object of type Response is not JSON serializable
datapusher    | Job "push_to_datastore (trigger: RunTriggerNow, run = True, next run at: None)" raised an exception
datapusher    | Traceback (most recent call last):
datapusher    |   File "/usr/lib/python3.8/site-packages/apscheduler/scheduler.py", line 512, in _run_job
datapusher    |     retval = job.func(*job.args, **job.kwargs)
datapusher    |   File "/usr/lib/python3.8/site-packages/datapusher/jobs.py", line 450, in push_to_datastore
datapusher    |     existing = datastore_resource_exists(resource_id, api_key, ckan_url)
datapusher    |   File "/usr/lib/python3.8/site-packages/datapusher/jobs.py", line 237, in datastore_resource_exists
datapusher    |     raise HTTPError(
datapusher    | datapusher.jobs.HTTPError: <unprintable HTTPError object>

randomorder commented 3 years ago

duplicate of https://github.com/geosolutions-it/ADBPO-C041/issues/82

etj commented 2 years ago

@lpasquali The setup is still missing the gather_consumer and the fetch_consumer processes: https://github.com/ckan/ckanext-harvest#setting-up-the-harvesters-on-a-production-server

They share the very same code and extensions of CKAN, so they may be either run inside the ckan container, or have their own. They communicate with the central ckan via redis, and need access to DB, solr and filesystem (config, storage)

Furthermore we need another harvester command run on cron every 5 minutes (doc about this at the same link above).

It would be good to have them log to file; also, if we make such files reside on the host filesystem it would be great.

etj commented 2 years ago

This issue is almost completed. There are still a couple of related tasks to implement, so I'm closing this one, moving the work to be done and related details into:

geosolutions-it / ckan-docker-dcatapit

Initial setup #1