ajschumacher / ajschumacher.github.io

blog
http://planspace.org/
20 stars 21 forks source link

something about all these pipelining tools #86

Open ajschumacher opened 9 years ago

ajschumacher commented 9 years ago

Collecting items from a slack with @lauralorenz, etc.:

Framing question(s):

Are there common frameworks out there that people use to manage larger data science software projects? something like django, but with a data spin? Or is there any sort of best practices about how to manage a large data science software project?

I guess another way to structure my question, is are there frameworks or best practices out there that enforce any sort of convention on the data pipeline.

Follow-up question:

Why are there so many of these??? And yet I don't know/like any of them very much? (Maybe that's part of the answer to the first part...)

A tour through several different angles on this:


sklearn has its pipeline stuff, but it's just for sklearn-compliant parts, and naturally just in Python: http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html


make: the classic (as espoused by kaggle, even!) http://blog.kaggle.com/2012/10/15/make-for-data-scientists/

https://www.gnu.org/software/make/


drake

Data workflow tool, like a "Make for data"

Kind of neat; and people at Factual get to play with Clojure!


http://bubbles.databrewery.org/ Bubbles is a Python framework for data processing and data quality measurement. Basic concept are abstract data objects, operations and dynamic operation dispatch.


http://keystone-ml.org/ "KeystoneML is software from the UC Berkeley AMPLab designed to simplify the construction of large-scale end-to-end machine learning pipelines."


https://github.com/spotify/luigi "Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in."


https://github.com/airbnb/airflow AirFlow is a system to programmatically author, schedule and monitor data pipelines.


https://github.com/pinterest/pinball Pinball is a scalable workflow manager


http://oozie.apache.org/ Oozie; you have to specify stuff with a bunch of XML (Steph presented on this)

Oozie is a workflow scheduler system to manage Apache Hadoop jobs.

Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions.

Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availabilty.

Oozie is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as system specific jobs (such as Java programs and shell scripts).

Oozie is a scalable, reliable and extensible system.


fluentd / logstash / kafka are sort of related in that they are about routing messages around and making sure those "pipelines" work without fail. Add on elasticsearch/kibana for even more data do-stuffery.


http://www.treasuredata.com/ PAAS: Probably As A Service - you can use this thing to set up fluentd stuff?


https://civisanalytics.com/products/platform/ the Civis Data Science Platform https://www.youtube.com/watch?v=nMalICUv1UM a layer of fancy on top of redshift?


http://www.blendo.co/ "Create one SingleSource of Truthfor your data." and host you data in el cloudo


something on versioning databases http://enterprisecraftsmanship.com/2015/08/10/database-versioning-best-practices/


What am I missing? What are the best and coolest tools?

ajschumacher commented 9 years ago

Harlan has been working on a thing: https://twitter.com/HarlanH/status/641026432349675521

tlevine commented 9 years ago

I manage to fit a lot into ordinary Python. I have been working on documentation of this approach, but the only decent documentation I have so far is this.

ajschumacher commented 9 years ago

Thanks @tlevine!

karlhigley commented 9 years ago

Spark's MLlib now supports pipelines.

ajschumacher commented 9 years ago

Thanks @karlhigley!

lauralorenz commented 9 years ago

More stuff I found: Joblib, which I've used before for parallel processing (a nice wrapper of the standard multiprocessing library), but actually can also manage checkpointing large computational pipelines with an easy decorator. Spyre, nascent data application framework built on cherrypy and jinja2 with convenience classes for data wrangling and data visualization. CubicWeb, a "semantic web framework" that builds from the datamodel upwards. Useful tools for building, observing, and updating RDBMS schemas out of the box. Cubes, framework to describe data models and auto-build APIs into it. Dispel4py, framework for abstractly defining distributed data workflows with supported backends such as Apache Storm. Also, the release paper Luigi, python library for job pipelining and comes with a web management console to track tasks. alembic and south for relational database versioning

ajschumacher commented 9 years ago

Awesome! Thanks @lauralorenz!

ajschumacher commented 9 years ago

More thoughtful thoughts:

I've separated out our list of concerns with our current solution in parentheses. I'd love suggestions for 1) some mega framework this all fits into nicely 2) suggestions at a more modular level e.g. ETL, ML, visualization 3) suggestions of better libraries/tools to use in the pipeline we've grown with to date And to clarify my meaning by ‘large’, I mean something that was once a collection of scripts, grew organically/without structure, and now is too big to handle. So not large in the sense of needing distributed infrastructure, as so far we've been able to deal by just going up hardware classes. ETL

  • Performs data ingestion and wrangling from diverse APIs into a data store (pandas, regular ol' python)
  • Supports smart rollback and reporting when data is corrupted (some function-bound commit/rollback with psycopg and try/excepts, 'logging' with print statements, not that wide reaching or easy to trace back)
  • Supports distributing the ETL tasks nightly onto on-demand large instances (cron/boto) ML
  • Supports distributing ML tasks (e.g. train, predict) on weekly/nightly schedules onto on-demand large instances (cron/boto) Visualization/Output
  • Interactive reporting of data from the ETL for use by business users (Shiny)
  • Construction of data API with endpoints for raw data, filtered/queried data, ML results, or to trigger a prediction () Overall
  • Supports versioning of database schemas preferably with upgrade/downgrade capabilities a la Django/south/alembic for relational and graph databases ()
  • Preserves secret keys safely (localsettings anti-pattern)
  • Supports seamless relational imports even when running convoluted/dependent file heirarchies as scripts ie however Django does that (a lot of mumbo jumbo a la PEP 328)
ajschumacher commented 8 years ago

I had another open issue called "something about Luigi" (#78) which just contained a note that an "old email thread with Caroline Alexiou is relevant".

ajschumacher commented 3 years ago

also Bazel is a DAG-based thing, specifically for builds