Open ajschumacher opened 9 years ago
Harlan has been working on a thing: https://twitter.com/HarlanH/status/641026432349675521
I manage to fit a lot into ordinary Python. I have been working on documentation of this approach, but the only decent documentation I have so far is this.
Thanks @tlevine!
Spark's MLlib now supports pipelines.
Thanks @karlhigley!
More stuff I found:
Joblib, which I've used before for parallel processing (a nice wrapper of the standard multiprocessing
library), but actually can also manage checkpointing large computational pipelines with an easy decorator.
Spyre, nascent data application framework built on cherrypy and jinja2 with convenience classes for data wrangling and data visualization.
CubicWeb, a "semantic web framework" that builds from the datamodel upwards. Useful tools for building, observing, and updating RDBMS schemas out of the box.
Cubes, framework to describe data models and auto-build APIs into it.
Dispel4py, framework for abstractly defining distributed data workflows with supported backends such as Apache Storm. Also, the release paper
Luigi, python library for job pipelining and comes with a web management console to track tasks.
alembic and south for relational database versioning
Awesome! Thanks @lauralorenz!
More thoughtful thoughts:
I've separated out our list of concerns with our current solution in parentheses. I'd love suggestions for 1) some mega framework this all fits into nicely 2) suggestions at a more modular level e.g. ETL, ML, visualization 3) suggestions of better libraries/tools to use in the pipeline we've grown with to date And to clarify my meaning by ‘large’, I mean something that was once a collection of scripts, grew organically/without structure, and now is too big to handle. So not large in the sense of needing distributed infrastructure, as so far we've been able to deal by just going up hardware classes. ETL
- Performs data ingestion and wrangling from diverse APIs into a data store (pandas, regular ol' python)
- Supports smart rollback and reporting when data is corrupted (some function-bound commit/rollback with psycopg and try/excepts, 'logging' with print statements, not that wide reaching or easy to trace back)
- Supports distributing the ETL tasks nightly onto on-demand large instances (cron/boto) ML
- Supports distributing ML tasks (e.g. train, predict) on weekly/nightly schedules onto on-demand large instances (cron/boto) Visualization/Output
- Interactive reporting of data from the ETL for use by business users (Shiny)
- Construction of data API with endpoints for raw data, filtered/queried data, ML results, or to trigger a prediction () Overall
- Supports versioning of database schemas preferably with upgrade/downgrade capabilities a la Django/south/alembic for relational and graph databases ()
- Preserves secret keys safely (localsettings anti-pattern)
- Supports seamless relational imports even when running convoluted/dependent file heirarchies as scripts ie however Django does that (a lot of mumbo jumbo a la PEP 328)
I had another open issue called "something about Luigi" (#78) which just contained a note that an "old email thread with Caroline Alexiou is relevant".
also Bazel is a DAG-based thing, specifically for builds
Collecting items from a slack with @lauralorenz, etc.:
Framing question(s):
Follow-up question:
A tour through several different angles on this:
sklearn
has its pipeline stuff, but it's just forsklearn
-compliant parts, and naturally just in Python: http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.htmlmake
: the classic (as espoused by kaggle, even!) http://blog.kaggle.com/2012/10/15/make-for-data-scientists/https://www.gnu.org/software/make/
drake
Kind of neat; and people at Factual get to play with Clojure!
http://bubbles.databrewery.org/ Bubbles is a Python framework for data processing and data quality measurement. Basic concept are abstract data objects, operations and dynamic operation dispatch.
http://keystone-ml.org/ "KeystoneML is software from the UC Berkeley AMPLab designed to simplify the construction of large-scale end-to-end machine learning pipelines."
https://github.com/spotify/luigi "Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in."
https://github.com/airbnb/airflow AirFlow is a system to programmatically author, schedule and monitor data pipelines.
https://github.com/pinterest/pinball Pinball is a scalable workflow manager
http://oozie.apache.org/ Oozie; you have to specify stuff with a bunch of XML (Steph presented on this)
fluentd / logstash / kafka are sort of related in that they are about routing messages around and making sure those "pipelines" work without fail. Add on elasticsearch/kibana for even more data do-stuffery.
http://www.treasuredata.com/ PAAS: Probably As A Service - you can use this thing to set up fluentd stuff?
https://civisanalytics.com/products/platform/ the Civis Data Science Platform https://www.youtube.com/watch?v=nMalICUv1UM a layer of fancy on top of redshift?
http://www.blendo.co/ "Create one SingleSource of Truthfor your data." and host you data in el cloudo
something on versioning databases http://enterprisecraftsmanship.com/2015/08/10/database-versioning-best-practices/
What am I missing? What are the best and coolest tools?