dathere / datapusher-plus

A standalone web service that pushes data into the CKAN Datastore fast & reliably. It pushes real good!
GNU Affero General Public License v3.0
26 stars 18 forks source link
ckan datastore open-data

DataPusher+

DataPusher+ is a fork of Datapusher that combines the speed and robustness of ckanext-xloader with the data type guessing of Datapusher.

Datapusher+ is built using CKAN Service Provider, with Messytables replaced by qsv.

TNRIS/TWDB provided the use cases that informed and supported the development of Datapusher+, specifically, to support a Resource-first upload workflow.

For a more detailed overview, see the CKAN Monthly Live Jan 2023 presentation.

It features:

[^1]: Why use qsv instead of a "proper" python data analysis library like pandas? [^2]: It takes 0.16 seconds with an index to run qsv stats against the qsv whirlwind tour sample file on a Ryzen 4800H (8 physical/16 logical cores) with 32 gb memory and a 1 TB SSD. Without an index, it takes 1.3 seconds. [^3]: Imagine you have a 1M row CSV, and the last row has an invalid value for a numeric column (e.g. "N/A" instead of a number). After spending hours pushing the data very slowly, legacy datapusher will abort on the last row and the ENTIRE job is invalid. Ok, that's bad, but what makes it worse is that the old table has been deleted already, and Datapusher doesn't tell you what caused the job to fail! YIKES!!!!

Development Installation

Datapusher+ is a drop-in replacement for Datapusher, so it's installed the same way.

  1. Install the required packages.

    sudo apt install python3-virtualenv python3-dev python3-pip python3-wheel build-essential libxslt1-dev libxml2-dev zlib1g-dev git libffi-dev libpq-dev file
  2. Create a virtual environment for Datapusher+ using at least python 3.8.

    cd /usr/lib/ckan
    sudo python3.8 -m venv dpplus_venv
    sudo chown -R $(whoami) dpplus_venv
    . dpplus_venv/bin/activate
    cd dpplus_venv

    ℹ️ NOTE: DP+ requires at least python 3.8 as it makes extensive use of new capabilities introduced in 3.7/3.8 to the subprocess module. If you're using Ubuntu 18.04 or earlier, follow the procedure below to install python 3.8:

    sudo add-apt-repository ppa:deadsnakes/ppa
    # we use 3.8 here, but you can get a higher version by changing the version suffix of the packages below
    sudo apt install python3.8 python3.8-venv python3.8-dev
    # install additional dependencies
    sudo apt install build-essential libxslt1-dev libxml2-dev zlib1g-dev git libffi-dev

    Note that DP+ still works with CKAN<=2.8, which uses older versions of python.

  3. Get the code.

    mkdir src
    cd src
    git clone --branch 0.11.0 https://github.com/datHere/datapusher-plus
    cd datapusher-plus
  4. Install the dependencies.

    pip install wheel
    pip install -r requirements-dev.txt
    pip install -e .
  5. Install qsv.

    Download the appropriate precompiled binaries for your platform and copy it to the appropriate directory, e.g. for Linux:

    wget https://github.com/jqnatividad/qsv/releases/download/0.108.0/qsv-0.108.0-x86_64-unknown-linux-gnu.zip
    unzip qsv-0.108.0-x86_64-unknown-linux-gnu.zip
    rm qsv-0.108.0-x86_64-unknown-linux-gnu.zip
    sudo mv qsv* /usr/local/bin

    Alternatively, if you want to install qsv from source, follow the instructions here. Note that when compiling from source, you may want to look into the Performance Tuning section to squeeze even more performance from qsv.

    Also, if you get glibc errors when starting qsv, your Linux distro may not have the required version of the GNU C Library (This will be the case when running Ubuntu 18.04 or older). If so, use the qsvdp_glibc-2.31 binary as its linked to an older version of glibc. If that still fails, the use the unknown-linux-musl.zip archive as it is statically linked with the MUSL C Library.

    If you already have qsv, update it to the latest release by using the --update option.

    qsvdp --update

    ℹ️ NOTE: qsv is a general purpose CSV data-wrangling toolkit that gets regular updates. To update to the latest version, just run qsv with the --update option and it will check for the latest version and update as required.

  6. Configure the Datapusher+ database.

    Make sure to create the datapusher PostgreSQL user and the datapusher_jobs database (see DataPusher+ Database Setup).

  7. Copy the datapusher/dot-env.template to datapusher/.env and modify your configuration.

    cd /usr/lib/ckan/dpplus_env/src/datapusher-plus/datapusher
    cp dot-env.template .env
    # configure your installation as required
    nano .env
  8. Run Datapusher+ in the dpplus_venv virtual environment.

    python main.py config.py

    By default, DP+ should be running at the following port:

    http://localhost:8800/

Production Deployment

There are two ways to deploy Datapusher+:

  1. Manual Deployment

    These instructions set up the DataPusher web service on uWSGI running on port 8800, but can be easily adapted to other WSGI servers like Gunicorn. You'll probably need to set up Nginx as a reverse proxy in front of it and something like Supervisor to keep the process up.

    # Install requirements for DataPusher+. Be sure to have at least Python 3.8
    sudo apt install python3-virtualenv python3-dev python3-pip python3-wheel build-essential libxslt1-dev libxml2-dev zlib1g-dev git libffi-dev libpq-dev file
    
    # Install qsv, if required
    wget https://github.com/jqnatividad/qsv/releases/download/0.108.0/qsv-0.108.0-x86_64-unknown-linux-gnu.zip -P /tmp
    unzip /tmp/qsv-0.108.0-x86_64-unknown-linux-gnu.zip -d /tmp
    rm /tmp/qsv-0.108.0-x86_64-unknown-linux-gnu.zip
    sudo mv /tmp/qsv* /usr/local/bin
    
    # if qsv is already installed, be sure to update it to the latest release
    sudo qsvdp --update
    
    # if you get a glibc error when running `qsvdp --update`
    # you're on an old distro (e.g. Ubuntu 18.04) without the required version of the glibc libraries.
    # If so, try running the qsvdp_glibc-2.31 binary instead. If it runs, you can use it instead of the default qsvdp binary.
    # If that still doesnt work, use the statically linked MUSL version instead
    # https://github.com/jqnatividad/qsv/releases/download/0.108.0/qsv-0.108.0-x86_64-unknown-linux-musl.zip
    
    # find out the locale settings
    locale
    
    # ONLY IF LANG is not "en_US.UTF-8", set locale
    export LC_ALL="en_US.UTF-8"
    export LC_CTYPE="en_US.UTF-8"
    sudo dpkg-reconfigure locales
    
    # Create a virtualenv for DataPusher+. DP+ requires at least python 3.8.
    sudo python3.8 -m venv /usr/lib/ckan/dpplus_venv
    sudo chown -R $(whoami) dpplus_venv
    
    # install datapusher-plus in the virtual environment
    . /usr/lib/ckan/dpplus_venv/bin/activate
    pip install wheel
    pip install datapusher-plus
    
    # create an .env file and tune DP+ settings. Tune the uwsgi.ini file as well
    sudo mkdir -p /etc/ckan/datapusher-plus
    sudo curl https://raw.githubusercontent.com/dathere/datapusher-plus/master/datapusher/dot-env.template -o /etc/ckan/datapusher-plus/.env
    sudo curl https://raw.githubusercontent.com/dathere/datapusher-plus/master/deployment/datapusher-uwsgi.ini -o /etc/ckan/datapusher-plus/uwsgi.ini
    
    # Be sure to initialize the database if required. (See Database Setup section below)
    # Be sure to edit the .env file and set the right database connect strings!
    
    # Create a user to run the web service (if necessary)
    sudo addgroup www-data
    sudo adduser -G www-data www-data

    At this point you can run DataPusher+ with the following command:

    /usr/lib/ckan/dpplus_venv/bin/uwsgi --enable-threads -i /etc/ckan/datapusher-plus/uwsgi.ini

    You might need to change the uid and guid in the uwsgi.ini file when using a different user.

    To deploy it using supervisor:

    sudo curl https://raw.githubusercontent.com/dathere/datapusher-plus/master/deployment/datapusher-uwsgi.conf -o /etc/supervisor/conf.d/datapusher-uwsgi.conf
    sudo service supervisor restart
  2. Dockerized Deployment

    As Datapusher+ is quite involved as evinced by the above procedure, a containerized installation will make it far easier not only to deploy DP+ to production, but also to experiment with.

    Instructions to set up the DP+ Docker instance can be found here.

    The DP+ Docker will also expose additional features and administrative interface to manage not only Datapusher+ jobs, but also to manage the CKAN Datastore.

Configuring

CKAN Configuration

Add datapusher to the plugins in your CKAN configuration file (generally located at /etc/ckan/default/ckan.ini):

ckan.plugins = <other plugins> datapusher

In order to tell CKAN where this webservice is located, the following must be added to the [app:main] section of your CKAN configuration file :

ckan.datapusher.url = http://127.0.0.1:8800/

There are other CKAN configuration options that allow to customize the CKAN - DataPusher integration. Please refer to the DataPusher Settings section in the CKAN documentation for more details.

ℹ️ NOTE: DP+ recognizes some additional TSV and spreadsheet subformats - xlsm and xlsb for Excel Spreadsheets, and tab for TSV files. To process these subformats, set ckan.datapusher.formats as follows in your CKAN.INI file:

ckan.datapusher.formats = csv xls xlsx xlsm xlsb tsv tab application/csv application/vnd.ms-excel application/vnd.openxmlformats-officedocument.spreadsheetml.sheet ods application/vnd.oasis.opendocument.spreadsheet

and add this entry to your CKAN's resource_formats.json file.

["TAB", "Tab Separated Values File", "text/tab-separated-values", []],

DataPusher+ Configuration

The DataPusher+ instance is configured in the .env file located in the working directory of DP+ (/etc/ckan/datapusher-plus when running a production deployment. The datapusher-plus/datapusher source directory when running a development installation.)

See dot-env.template for a summary of configuration options available.

DataPusher+ Database Setup

DP+ requires a dedicated PostgreSQL account named datapusher to connect to the CKAN Datastore.

To create the datapusher user and give it the required privileges to the datastore_default database:

su - postgres
psql -d datastore_default
CREATE ROLE datapusher LOGIN PASSWORD 'YOURPASSWORD';
GRANT CREATE, CONNECT, TEMPORARY, SUPERUSER ON DATABASE datastore_default TO datapusher;
GRANT SELECT, INSERT, UPDATE, DELETE, TRUNCATE ON ALL TABLES IN SCHEMA public TO datapusher;
\q

DP+ also requires its own job_store database to keep track of all the DP+ jobs. In the original Datapusher, this was a sqlite database by default. Though DP+ can still use a sqlite database, we are discouraging its use.

To setup the datapusher_jobs database and its user:

sudo -u postgres createuser -S -D -R -P datapusher_jobs
sudo -u postgres createdb -O datapusher_jobs datapusher_jobs -E utf-8

Usage

Any file that has one of the supported formats (defined in ckan.datapusher.formats) will be attempted to be loaded into the DataStore.

You can also manually trigger resources to be resubmitted. When editing a resource in CKAN (clicking the "Manage" button on a resource page), a new tab named "DataStore" will appear. This will contain a log of the last attempted upload and a button to retry the upload. Once a resource has been "pushed" into the Datastore, a "Data Dictionary" tab will also be available where the data pusblisher can fine-tune the inferred data dictionary.

DataPusher+ UI DataPusher+ UI 2

Command line

Run the following command to submit all resources to datapusher, although it will skip files whose hash of the data file has not changed:

ckan -c /etc/ckan/default/ckan.ini datapusher resubmit

On CKAN<=2.8:

paster --plugin=ckan datapusher resubmit -c /etc/ckan/default/ckan.ini

To Resubmit a specific resource, whether or not the hash of the data file has changed::

ckan -c /etc/ckan/default/ckan.ini datapusher submit {dataset_id}

On CKAN<=2.8:

paster --plugin=ckan datapusher submit <pkgname> -c /etc/ckan/default/ckan.ini

Testing

To test Datapusher-plus, you can use the following test script available on GitHub: test script.

Uninstalling Datapusher+

Should you need to remove Datapusher+, and you followed either the Development or Production Installation procedures above:

# if you're running inside the dpplus_venv virtual environment, deactivate it first
deactivate

# remove the DP+ python virtual environment
sudo rm -rf /usr/lib/ckan/dpplus_venv

# remove the supervisor DP+ configuration
sudo rm -rf /etc/supervisor/conf.d/datapusher-uwsgi.conf

# remove the DP+ production deployment directory
sudo rm -rf /etc/ckan/datapusher-plus

# remove qsv binary variants
sudo rm /usr/local/bin/qsv /usr/local/bin/qsvdp /usr/local/bin/qsvlite /usr/local/bin/qsv_nightly /usr/local/bin/qsvdp_nightly /usr/local/bin/qsvlite_nightly

# restart the supervisor, without the Datapusher+ service
sudo service supervisor reload

# ========= DATABASE objects ============
# OPTIONAL: backup the datapusher_jobs database first if 
# you want to retain the DP+ job history
sudo -u postgres pg_dump --format=custom -d datapusher_jobs > datapusher_jobs.dump

# to remove the Datapusher+ job database and the datapusher_jobs user/role
sudo -u postgres dropdb datapusher_jobs
sudo -u postgres dropuser datapusher_jobs

# to drop the datapusher user which DP+ uses to write to the CKAN Datastore
sudo -u postgres dropuser datapusher

To ensure the Datapusher+ service is not automatically invoked when tabular resources are uploaded, remove datapusher from ckan.plugins in your ckan.ini file.

Also remove/comment out the following ckan.datapusher entries in your ckan.ini:

Note that resources which has been pushed previously will still be available on the CKAN Datastore. You will have to delete these resources separately using the UI or the CKAN resource_delete API.

If you're no longer using the CKAN Datastore:

To confirm the uninstallation is successful, upload a new tabular resource and check if:

License

This material is copyright (c) 2020 Open Knowledge Foundation and other contributors

It is open and licensed under the GNU Affero General Public License (AGPL) v3.0 whose full text may be found at:

http://www.fsf.org/licensing/licenses/agpl-3.0.html