Develop a new data ingest / ETL pipeline for indexing eQTL data into the new mongo database

karatugo commented 1 month ago

We need to develop a robust and scalable data ingest/ETL (Extract, Transform, Load) pipeline to facilitate the reading of eQTL (expression Quantitative Trait Loci) data from FTP sources, indexing it into a MongoDB database, and serving it via an API. This pipeline will ensure efficient data extraction, transformation, and retrieval to support downstream analysis and querying through a web service.

[x] Scalable data ingest/ETL pipeline
[x] Read from FTP sources
[x] Ingest with the correct schema
[x] Save to MongoDB
[ ] Index MongoDB - is it automatic? discuss with DBA team
[ ] Deploy to Sandbox
[ ] Deploy to Prod
[ ] Implement API

karatugo commented 1 month ago

Files to Index

QTD0000*.all.tsv.gz: Contains comprehensive eQTL data. This should be the primary source for indexing.
QTD0000*.cc.tsv.gz: Contains specific eQTL data (likely condition-specific or subset). Also useful for indexing.
QTD0000*.permuted.tsv.gz: Contains permuted eQTL data for significance testing. Useful for specific analyses but not primary indexing.

Suggested MongoDB Schema

Here's a refined schema to capture the necessary details from these files:

Study Information:
- study_id: QTD000021
- study_name: "Sample eQTL Study"
Sample Information:
- sample_id: Auto-generated or derived from context if available?
eQTL Information:
- molecular_trait_id: Corresponding trait ID.
- molecular_trait_object_id: Object ID for the molecular trait.
- chromosome: Chromosome number.
- position: Position on the chromosome.
- ref: Reference allele.
- alt: Alternative allele.
- variant: Variant identifier.
- ma_samples: Minor allele sample count.
- maf: Minor allele frequency.
- pvalue: P-value of the association.
- beta: Effect size.
- se: Standard error.
- type: Variant type (e.g., SNP).
- aan: Additional annotation number.
- r2: R-squared value.
- gene_id: Gene identifier.
- median_tpm: Median TPM (Transcripts Per Million).
- rsid: Reference SNP ID.
Permuted eQTL Information:
- p_perm: Permuted p-value.
- p_beta: Permuted beta value.

Example MongoDB Document Structure

{
  "study_id": "QTD000021",
  "study_name": "Sample eQTL Study",
  "samples": [
    {
      "sample_id": "sample001",
      "eqtls": [
        {
          "molecular_trait_id": "ENSG00000187583",
          "molecular_trait_object_id": "ENSG00000187583",
          "chromosome": "1",
          "position": 14464,
          "ref": "A",
          "alt": "T",
          "variant": "chr1_14464_A_T",
          "ma_samples": 41,
          "maf": 0.109948,
          "pvalue": 0.15144,
          "beta": 0.25567,
          "se": 0.17746,
          "type": "SNP",
          "aan": 42,
          "r2": 382,
          "gene_id": "ENSG00000187583",
          "median_tpm": 0.985,
          "rsid": "rs546169444",
          "permuted": {
            "p_perm": 0.000999001,
            "p_beta": 3.3243e-12
          }
        }
      ]
    }
  ]
}

Steps to Implement

Extract Data:
- Parse QTD0000*.all.tsv.gz and QTD0000*.cc.tsv.gz to extract eQTL data.
- Parse QTD0000*.permuted.tsv.gz to extract permuted data and merge with the main eQTL data.
Transform Data:
- Normalize data fields and structure according to the MongoDB schema.
Load Data:
- Insert the structured documents into MongoDB.
- Ensure appropriate indexes on fields such as gene_id, chromosome, position, and variant for efficient querying.
API Development:
- Develop endpoints for querying the eQTL data based on different parameters.

Indexing Strategy

Create indexes on key fields for efficient retrieval:
- gene_id
- chromosome
- position
- variant
- rsid

karatugo commented 1 week ago

@karatugo Focus on Mongo indexing, deployment and API development

EBISPOT / eqtl-sumstats-service