EBISPOT / eqtl-sumstats-service

eQTL Summary Statistics Service
0 stars 0 forks source link

eQTL Summary Statistics Service

Overview

This project provides an ETL (Extract, Transform, Load) pipeline built using Apache Spark, which processes data files, extracts information, and loads it into a MongoDB database. The pipeline runs in a Dockerized environment, utilizing multiple services including MongoDB and Spark.

Project Structure

.
├── spark
│   ├── __init__.py
│   ├── Dockerfile
│   ├── log4j.properties
│   ├── spark_app.py
│── utils
│   ├── __init__.py
│   ├── constants.py
│   ├── requirements.txt
│   └── utils.py
├── .gitignore
├── docker-compose.yml
├── format-lint
├── pytype.cfg
├── README.md
└── requirements.dev.txt

Key Files and Directories

Getting Started

Prerequisites

Setup

  1. Clone this repository:

    git clone https://github.com/EBISPOT/eqtl-sumstats-service.git
    cd eqtl-sumstats-service
  2. Build and start the Docker containers:

    docker-compose build
    docker-compose up

    This will pull the necessary Docker images, build the custom Spark application image, and start the services (MongoDB, Spark Master, Spark Worker, Spark Application).

Running the ETL Pipeline

The ETL pipeline is automatically triggered when the Spark application container starts. The spark_app.py script performs the following tasks:

  1. Download: Fetches data files from a remote FTP server.
  2. Process: Parses and transforms the data using Spark.
  3. Load: Writes the processed data into a MongoDB collection.

Configuration

Development

It might be a good idea to limit dataframes to 10 rows in spark. Otherwise it might be a problem in your local development. One can search for DEV for such points.

Linting and Formatting

Create a virtual env for format & lint in which you can install the required Python packages using:

pip install -r requirements.dev.txt

One can run the script as follows:

./format-lint