"Runnable" data packages

psychemedia commented 7 years ago

I was wondering if anyone has looked at toolchains that support the creation of ad hoc working environments around a particular environment.

For example, I have a recipe for creating an analysis environment that links and launches RStudio connected to MySQL seeded with a particular dataset using Docker compose and a Dockerfile that created the seeded database:

#Build this container based on the pre-existing offical MySQL Docker image
FROM mysql

#The offical image listens to requests on localost, IP address 127.0.0.1
#We're going to want it to be a bit more promiscuous
#and listen out for requests coming from *any* IP address (0.0.0.0)
#sed is a Linux commandline text editor - it updates the IP address in the MySQL config file
RUN sed -i -e"s/^bind-address\s*=\s*127.0.0.1/bind-address=0.0.0.0/" /etc/mysql/my.cnf

#Install a command line utility called wget
RUN apt-get update && apt-get install -y wget && apt-get clean

#Use wget to download a copy of the MySQL database dump from ergast.com
RUN wget http://ergast.com/downloads/f1db.sql.gz -P /docker-entrypoint-initdb.d

#Unzip the download
#The original MySQL container is configured to install items
#in the /docker-entrypoint-initdb.d/ directory when it starts up
RUN gunzip /docker-entrypoint-initdb.d/f1db.sql.gz

The seeded database is linked to RStudio:

ergastdb:
  container_name: ergastdb
  build: ergastdb/
  environment:
    MYSQL_ROOT_PASSWORD: f1
    MYSQL_DATABASE: ergastdb
  expose:
    - "3306"

f1dj_rstudio:
  image: rocker/tidyverse 
  ports:
    - "8787:8787"
  links:
    - ergastdb:ergastdb
  volumes:
    - ./rstudio:/home/rstudio

So in the extreme case, I imagine running a command something like the following as an equivalent to the above:

reallyfrictionless -app rocker/tidyverse -app jupyter/datascience-notebook -dbms myqsl -dbalias ergastdb -db ergastdb -dbpwd f1 -datasrc http://ergast.com/downloads/f1db.sql.gz -datatyp sql

and consequently being presented with an environment (Jupyter notebooks and RStudio) opened in my browser that let me link to the populated DB and access the data from it directly.

(Actually it'd have to be a tad more complicated to support eg mounted directories/volumes? Hmm... Maybe easier if we just allow a single -app then -guestdir and -hostdir switches, or a single -appvolume switch?)

-app: name of application container(s) to install -dbms: DBMS container -dbalias: alias by which DBMS is accessed from app containers -dbpwd: DBMS root password -db: database created and used inside DBMs -datasrc: data file -datatyp: datafile type

For a datapackage, use -datatyp datapackage and the URL to the data package, and configure the database from that?

(To simplify things on the DBMS side, may be sensible to define images reallyfrictionless/postgres, reallyfrictionless/mysql etc that have standardised config variables?)

In the interim, how about some simple Dockerfile recipes that allow a datapackage to be identified and a runnable DBMS docker image created from it?

pwalsh commented 7 years ago

Hi @psychemedia

have you looked at https://github.com/frictionlessdata/datapackage-pipelines ?

If I understand your use case correctly, DPP could be used to do exactly what you want, with the following additions on top of the base library:

A custom processor to open the current data package being streamed in R Studio
A custom processor to open the current data package being streamed in a Jupiter Notebook
Have a docker file that calls dpp with a pipeline spec, and takes a few env vars (path to the DP to open, open in R Studio or Jupyter, etc.)

We've love to see such a use case and even contributions of processors to do these things!

psychemedia commented 7 years ago

@pwalsh Not played with that, no... will take a look...

psychemedia commented 7 years ago

@pwalsh Actually, the datapackage isn't opened into Jupyter or RStudio environment directly. It's loaded into the database container, which is then linked to, and accessible from, the application environment(s).

That's why I posted to this repo...;-)

So the question more succinctly is: is there a simple, automated recipe for populating a MySQL or Postgres container with a datapackage so that someone who doesn't know how to set up a database can create a running containerised instance of a database containing the datapackage contents from a simple command?

roll commented 7 years ago

Thanks! It's MOVED to https://github.com/frictionlessdata/implementations/issues/18

frictionlessdata / tableschema-sql-py

"Runnable" data packages #62