criteo / cluster-pack

A library on top of either pex or conda-pack to make your Python code easily available on a cluster
Apache License 2.0
45 stars 21 forks source link
conda-pack hdfs pex pyspark s3 skein

cluster-pack

cluster-pack is a library on top of either pex or conda-pack to make your Python code easily available on a cluster.

Its goal is to make your prod/dev Python code & libraries easiliy available on any cluster. cluster-pack supports HDFS/S3 as a distributed storage.

The first examples use Skein (a simple library for deploying applications on Apache YARN) and PySpark with HDFS storage. We intend to add more examples for other applications (like Dask, Ray) and S3 storage.

An introducing blog post can be found here.

cluster-pack

Installation

Install with Pip

$ pip install cluster-pack

Install from source

$ git clone https://github.com/criteo/cluster-pack
$ cd cluster-pack
$ pip install .

Prerequisites

cluster-pack supports Python ≥3.7.

Features

Basic examples with skein

1) Interactive mode

2) Self shipping project

Basic examples with PySpark

1) PySpark with HDFS on Yarn

2) Docker with PySpark on S3