caldempsey / docker-notebook-spark-s3

Template CI friendly local development environment featuring Spark Clusters + Blob Storage + a Notebook for prototyping data feature delivery.
4 stars 0 forks source link
ai cicd docker infrastructure notebook-jupyter pyspark scala spark terraform

localhost/build

DockerNotebookSparkS3

This repository provides a local experimental environment for data lakes and mock blob storage, leveraging PySpark and Spark clusters. It allows you to mimic Blob Storage locally and manage it with an Jupyter Notebook connected to a Spark Cluster closely emulating a real but simple environment.

This setup uses mvn to pull artefacts and transitive dependencies for Spark, e.g. Databricks Delta Lake, used as an example in this template, directly into the Spark's jars without any requirement for network requests from Spark, providing an effective template for the CI deployment for data processing pipelines and analytics in a secure or controlled setting.

Effortlessly dive in and unleash your data's potential, today!

Features

Getting Started

Use make or follow these steps to set up the environment via Just:

  1. Clone this repository.
  2. Ensure Docker is installed.
  3. Install just.
  4. Run just deploy.
  5. Access Jupyter at http://localhost:8890 with token canttouchthis.
  6. Start experimenting with data lakes, mock blob storage, and PySpark notebooks!

Repository Structure

Commands should be run from the root of the repository or using Just.

Configuration

Customize the template for your specific requirements and use cases. Since everything is hard coded for the moment, you probably want to find and replace the term orgname to suit you.

Happy Coding! ✨