This project provides an ETL (Extract, Transform, Load) pipeline built using Apache Spark, which processes data files, extracts information, and loads it into a MongoDB database. The pipeline runs in a Dockerized environment, utilizing multiple services including MongoDB and Spark.
.
├── spark
│ ├── __init__.py
│ ├── Dockerfile
│ ├── log4j.properties
│ ├── spark_app.py
│── utils
│ ├── __init__.py
│ ├── constants.py
│ ├── requirements.txt
│ └── utils.py
├── .gitignore
├── docker-compose.yml
├── format-lint
├── pytype.cfg
├── README.md
└── requirements.dev.txt
Clone this repository:
git clone https://github.com/EBISPOT/eqtl-sumstats-service.git
cd eqtl-sumstats-service
Build and start the Docker containers:
docker-compose build
docker-compose up
This will pull the necessary Docker images, build the custom Spark application image, and start the services (MongoDB, Spark Master, Spark Worker, Spark Application).
The ETL pipeline is automatically triggered when the Spark application container starts. The spark_app.py
script performs the following tasks:
constants.py
file.constants.py
.spark_app.py
.It might be a good idea to limit dataframes to 10 rows in spark. Otherwise it might be a problem in your local development. One can search for DEV
for such points.
Create a virtual env for format & lint in which you can install the required Python packages using:
pip install -r requirements.dev.txt
One can run the script as follows:
./format-lint