This repository provides a local experimental environment for data lakes and mock blob storage, leveraging PySpark and Spark clusters. It allows you to mimic Blob Storage locally and manage it with an Jupyter Notebook connected to a Spark Cluster closely emulating a real but simple environment.
This setup uses mvn
to pull artefacts and transitive dependencies for Spark, e.g. Databricks Delta Lake, used as an example in this template, directly into the Spark's jars without any requirement for network requests from Spark, providing an effective template for the CI deployment for data processing pipelines and analytics in a secure or controlled setting.
Effortlessly dive in and unleash your data's potential, today!
infra-data-lake
pom file and pulled onto the repository via mvn
-based bash get_spark_deps.sh
.Use make
or follow these steps to set up the environment via Just:
just deploy
.http://localhost:8890
with token canttouchthis
.infra-data-lake/localhost
: Delta Lake and notebooks for local connectivity.infra-mock-blob-storage
: Local mock for Blob Storage.notebook-data-lake
: Contains notebooks for data exploration and analysis.Commands should be run from the root of the repository or using Just.
Customize the template for your specific requirements and use cases. Since everything is hard coded for the moment, you probably want to find and replace the term orgname
to suit you.
Happy Coding! ✨