dathere / datapusher-plus

A standalone web service that pushes data into the CKAN Datastore fast & reliably. It pushes real good!
GNU Affero General Public License v3.0
26 stars 18 forks source link

Convert DP+ to a CKAN extension #98

Open jqnatividad opened 1 year ago

jqnatividad commented 1 year ago

DP+ was forked from DP which uses ckanserviceprovider.

ckanserviceprovider was meant to be a general library for CKAN for making web services, and was originally created in 2013.

Right now, DP and DP+ are really the only users of the library, and in the 10 years since, a lot has changed.

DP+ 0.x will continue to use ckanserviceprovider so older CKAN installations using DataPusher can still use it.

However, especially now that CKAN v2.9 uses Python 3.8+ which DP+ requires to call qsv using subprocess options only available in Python 3.7+, v1.x of DP+ will work as a CKAN extension.

This has the following benefits:

This is already WIP in the https://github.com/dathere/datapusher-plus/tree/dev-v1.0 branch and we hope to the make the first release summer 2023.

jqnatividad commented 1 year ago

DP+ 1.x should also be more interactive - i.e. users have the ability to configure the DP+ job on a per resource basis.

99% of the time, the default configuration should work fine. But when the user wants to tweak DP+ configuration for that particular job (e.g ask for PII screening using a particular PII screening configuration, add summary stats, check for dupes, etc.), there should be some DP+ knobs exposed in the Datastore tab to change DP+ settings just for that resource.

sagargg commented 11 months ago

Hi @jqnatividad Any idea, when it will be ready for release ??

tino097 commented 11 months ago

@sagargg its almost done, any testing on the dev-v1.0 branch would be helpfull

jhbruhn commented 6 months ago

What is the current state of this migration? I am considering setting up datapusher-plus, but would like to avoid to set up an external service if future updates require it to be an ckanext instead. Could I already use the dev-1.0 branch for this, and if so, how?

tino097 commented 6 months ago

@jhbruhn dev-1.0 can be used to set the DP+ as extension and you are more than welcome to give it a try

To use it, just checkout to that branch and follow same steps as any other extension for installation

jhbruhn commented 6 months ago

Thanks. I installed the extension in my ckan 2.10 test instance and included it as the "datapusher" plugin (according to this https://github.com/dathere/datapusher-plus/blob/1cfba1c1373b6890f7338df1fa0f46a28297f0ef/setup.py#L85). I then defined CKAN_DATAPUSHER_URL as ckan would not start if not defined to the value I was using with the original datapusher, but stopped that service.

But when I upload a file and try to push it to the datastore (did happen automatically as well), I get this error in the Web Interface: Could not connect to DataPusher. without a more specific error in the ckan logs.

tino097 commented 6 months ago

You should start a job worker ckan jobs worker

jhbruhn commented 6 months ago

I did (now), but the result is still the same. Could the original, ckan internal datapusher be interfering with datapusher-plus here? Should I open another issue?

jhbruhn commented 6 months ago

If I provide my original datapusher configuration and service next to the activated dataplusher configuration, it uses the classic datapusher to push data to my resources. While a task in the workerqueue for datapusher-plus seems to be created, it is not used, and nothing is happening on my ckan jobs worker worker.

tino097 commented 6 months ago

@jhbruhn in the plugins it should be added as ckan.plugins = stats ... datapusher_plus ...

jhbruhn commented 6 months ago

That leads to this exception at startup:

ckan   | Traceback (most recent call last):
ckan   |   File "/srv/app/wsgi.py", line 20, in <module>
ckan   |     application = make_app(config)
ckan   |   File "/srv/app/src/ckan/ckan/config/middleware/__init__.py", line 27, in make_app
ckan   |     load_environment(conf)
ckan   |   File "/srv/app/src/ckan/ckan/config/environment.py", line 69, in load_environment
ckan   |     p.load_all()
ckan   |   File "/srv/app/src/ckan/ckan/plugins/core.py", line 222, in load_all
ckan   |     load(*plugins)
ckan   |   File "/srv/app/src/ckan/ckan/plugins/core.py", line 238, in load
ckan   |     service = _get_service(plugin)
ckan   |   File "/srv/app/src/ckan/ckan/plugins/core.py", line 346, in _get_service
ckan   |     raise PluginNotFoundException(plugin_name)
ckan   | ckan.plugins.core.PluginNotFoundException: datapusher_plus

Perhaps due to this commit? https://github.com/dathere/datapusher-plus/commit/1cfba1c1373b6890f7338df1fa0f46a28297f0ef

The plugin is definitely installed, as some of it is run when including the datapusher plugin.

tino097 commented 6 months ago

@jhbruhn my bad, forgot that i had that changed

Are you using dockerized environment for your setup ?

jhbruhn commented 6 months ago

Yes, this is the ckan Dockerfile I am using, based off of the official image:

FROM ckan/ckan-base:2.10

### DCAT ###
ARG CKANEXT_DCAT_VERSION="1.5.1"
RUN pip3 install -e git+https://github.com/ckan/ckanext-dcat.git@v${CKANEXT_DCAT_VERSION}#egg=ckanext-dcat && \
    pip3 install -r https://raw.githubusercontent.com/ckan/ckanext-dcat/v${CKANEXT_DCAT_VERSION}/requirements.txt

### LDAP with ignore referrals patch ###
ARG CKANEXT_LDAP_VERSION="add_ignore_referrals"
RUN apk add --no-cache \
        openldap-dev
RUN pip3 install -e git+https://github.com/jhbruhn/ckanext-ldap.git@${CKANEXT_LDAP_VERSION}#egg=ckanext-ldap

### PDFView
ARG CKANEXT_PDFVIEW_VERSION="0.0.8"
RUN pip3 install -e git+https://github.com/ckan/ckanext-pdfview.git@${CKANEXT_PDFVIEW_VERSION}#egg=ckanext-pdfview

### Geoview
ARG CKANEXT_GEOVIEW_VERSION="0.1.0"
RUN pip3 install ckanext-geoview==${CKANEXT_GEOVIEW_VERSION}

### Spatial
ARG CKANEXT_SPATIAL_VERSION="2.1.1"
RUN apk add --no-cache \
        proj-util \
        proj-dev \
        geos-dev
RUN pip3 install -e git+https://github.com/ckan/ckanext-spatial.git@v${CKANEXT_SPATIAL_VERSION}#egg=ckanext-spatial && \
    pip3 install -r https://raw.githubusercontent.com/ckan/ckanext-spatial/v${CKANEXT_SPATIAL_VERSION}/requirements.txt

### Hierarchy
ARG CKANEXT_HIERARCHY_VERSION="1.2.1"
RUN pip3 install -e git+https://github.com/ckan/ckanext-hierarchy.git@v${CKANEXT_HIERARCHY_VERSION}#egg=ckanext-hierarchy && \
    pip3 install -r https://raw.githubusercontent.com/ckan/ckanext-hierarchy/v${CKANEXT_HIERARCHY_VERSION}/requirements.txt

### Pages
ARG CKANEXT_PAGES_VERSION="0.5.2"
RUN pip3 install -e git+https://github.com/ckan/ckanext-pages.git@v${CKANEXT_PAGES_VERSION}#egg=ckanext-pages && \
    pip3 install -r https://raw.githubusercontent.com/ckan/ckanext-pages/v${CKANEXT_PAGES_VERSION}/requirements.txt

### Datapusher Plus
ARG CKANEXT_DATAPUSHER_PLUS_VERSION="1cfba1c1373b6890f7338df1fa0f46a28297f0ef"
RUN pip3 install -e git+https://github.com/dathere/datapusher-plus.git@${CKANEXT_DATAPUSHER_PLUS_VERSION}#egg=datapusher-plus && \
    pip3 install -r https://raw.githubusercontent.com/dathere/datapusher-plus/${CKANEXT_DATAPUSHER_PLUS_VERSION}/requirements.txt

# Activate plugins
ENV CKAN__PLUGINS "stats text_view image_view datatables_view video_view audio_view ldap pdf_view dcat dcat_json_interface structured_data envvars datapusher_plus datastore resource_proxy geo_view geojson_view shp_view wmts_view spatial_metadata spatial_query hierarchy_display hierarchy_form hierarchy_group_form pages"

qsv is obviously still missing from that Dockerfile, but it doesn't event get that far to call datapusher_plus, it is only calling the classic datapusher functions as far as I can see.

tino097 commented 6 months ago

Thanks for the input, i will try to locate whats the issue with dockerized setup. Starting DP+ from source setup was working and it is testing on one of our development environments. I hope i could fix this soon and start using the DP+ as extension more often

jhbruhn commented 3 months ago

I've just looked into this again, and it seems that I am getting further than before, yay!

I now get this error while executing a datapusher-plus job:

[SQL: INSERT INTO logs (job_id, timestamp, message, level, module, "funcName", lineno) VALUES (%(job_id)s, %(timestamp)s, %(message)s, %(level)s, %(module)s, %(funcName)s, %(lineno)s) RETURNING logs.id]
[parameters: {'job_id': '9de313e1-1969-45af-a07d-67031ff65fd0', 'timestamp': datetime.datetime(2024, 3, 28, 10, 51, 29, 238720), 'message': 'Fetching from: http://localhost:5000/dataset/a68c293b-d3ea-4318-9bac-c122ff4552b9/resource/1900fa9d-baf6-442a-9e58-d86e4e7d873c/download/kanton_zuerich_266.csv...', 'level': 'INFO', 'module': 'jobs', 'funcName': '_push_to_datastore', 'lineno': 441}]
(Background on this error at: https://sqlalche.me/e/14/f405) (Background on this error at: https://sqlalche.me/e/14/7s2a)
Traceback (most recent call last):
  File "/usr/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1900, in _execute_context
    self.dialect.do_execute(
  File "/usr/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 736, in do_execute
    cursor.execute(statement, parameters)
psycopg2.errors.UndefinedColumn: column logs.id does not exist
LINE 1: ...'INFO', 'jobs', '_push_to_datastore', 441) RETURNING logs.id
                                                                ^

I have run the ckan datapusher init-db command, so the migrations should've been run. This is running on a (testing) original datapusher/datastore database. Should I move this to a new issue, or is it expected behaviour for DP+ dev-1.0 right now?

tino097 commented 3 months ago

@jhbruhn thanks for testing the DP+

Yes, there is a issue with the migration script, i think i have a fix for it. Currently is on fixing-issues branch but not ready for merging into dev-v1.0 yet

jhbruhn commented 3 months ago

Thanks for your reply, that branch is indeed working quite well with my limited tests.

pdelboca commented 2 months ago

@jqnatividad is there any roadmap for the v1.0 of datapusher-plus? I'm looking forward to start using it, but currently the fact that it is based in ckan-serviceprovider is a limitation. Even more knowing that soon it will be a CKAN extension.

I have been testing the dev-v1.0 (and fixing-issues) branch and for normal workflows I didn't find any major issue.

Please let me know if there is anything else we can help with!

jqnatividad commented 2 months ago

@pdelboca , it's really on me. DP+ 0.x stalled a bit because the main implementation that was informing its build was paused for a few months.

We started up again a month ago and we're currently writing specs for that implementation (we just actually submitted the "DP+ - Data Resource Upload First (DRUF) edition" specs for review by the customer today)

Good thing is @tino097 has been working diligently on DP+ 1.0 regardless given that ckanext-serviceprovider has been deprecated, and I don't intend to keep working on 0.x.

I'll generalize the spec and publish it here as it will functionally serve as the DP+ 1.0 roadmap.

We are breaking down DP+ DRUF development into three phases.


Phase 1 will be primarily adapted to the customer's requirements and is targeted for a 2024 summer release.

Phase 2 is still being scoped out but its focus will be being able to apply DRUF metadata inferencing to spatial data formats; being able to specify formulas for computed metadata fields in the scheming yaml file, and using the iDataDictionaryForm to implement an expanded data dictionary with to store summary statistics and frequency data for each field.

Phase 3 is even more aspirational, but its looking to do "smarter analysis" using machine learning techniques (perhaps, leveraging @amercader's work on related datasets; and maybe using qsv describegpt to do tag suggestions.

I'm also thinking of using qsv's schema and validate commands.

schema to automatically create a JSONSchema validation files that can be used to enforce field data types and domain/range values when updating a resource. For each column in a CSV, it will enforce the following:

The generated JSONschema file can be further fine-tuned by the Data Steward if required to apply some fairly complex validation rules as the crate I'm using fully implements JSONSchema validation and is widely deployed and used in production systems in the Rust ecosystem.

Currently, qsv validate command is now being used by DP+ to ensure that CSVs are well-formed and conform to RFC4180 standard and is UTF8 encoded. In Phase 3, I want to use validate's ability to use a JSONSchema generated by qsv schema to validate the CONTENTS of the CSV conform to the JSONschema definition.

And it's quite performant - on benchmarks using a 1 million row sample of NYC's 311 data with a non-trivial JSONschema, it validates the CSV in 1.3 seconds!

And yeah, we would love to take your help! Perhaps, we can carve out some time at CSVConf next month?

cc @twdbben @wardi

pdelboca commented 2 months ago

That's an exciting roadmap @jqnatividad ! Thanks for sharing and let's definitely catchup at the CSVConf :)

As a suggestion/thought: In terms of releases, it would be nice to split architectural change and features into separated releases. It will be easier to develop, easier to deploy, easier to maintain and will give the community time to change. Mixing behavioral changes with architectural can be messy because new errors can come from two fronts :). In that sense, an initial release of DP+ v1.0 as a CKAN extension would be great.

u10313335 commented 1 week ago

Since the DP+ has been a CKAN extension and we no longer need the built-in datapusher plugin, I think we should include the assets from the datapusher plugin to prevent from the following import errors:

2024-07-02 08:49:09,763 ERROR [ckan.lib.webassets_tools] Trying to include unknown asset:
<ckanext-datapusher/datapusher-css>
2024-07-02 08:49:09,823 ERROR [ckan.lib.webassets_tools] Trying to include unknown asset:
<ckanext-datapusher/datapusher>