EBISPOT / eqtl-sumstats-service

eQTL Summary Statistics Service
0 stars 0 forks source link

Develop a new data ingest / ETL pipeline for indexing eQTL data into the new mongo database #3

Open karatugo opened 1 month ago

karatugo commented 1 month ago

We need to develop a robust and scalable data ingest/ETL (Extract, Transform, Load) pipeline to facilitate the reading of eQTL (expression Quantitative Trait Loci) data from FTP sources, indexing it into a MongoDB database, and serving it via an API. This pipeline will ensure efficient data extraction, transformation, and retrieval to support downstream analysis and querying through a web service.

karatugo commented 1 month ago

Files to Index

Suggested MongoDB Schema

Here's a refined schema to capture the necessary details from these files:

  1. Study Information:

    • study_id: QTD000021
    • study_name: "Sample eQTL Study"
  2. Sample Information:

    • sample_id: Auto-generated or derived from context if available?
  3. eQTL Information:

    • molecular_trait_id: Corresponding trait ID.
    • molecular_trait_object_id: Object ID for the molecular trait.
    • chromosome: Chromosome number.
    • position: Position on the chromosome.
    • ref: Reference allele.
    • alt: Alternative allele.
    • variant: Variant identifier.
    • ma_samples: Minor allele sample count.
    • maf: Minor allele frequency.
    • pvalue: P-value of the association.
    • beta: Effect size.
    • se: Standard error.
    • type: Variant type (e.g., SNP).
    • aan: Additional annotation number.
    • r2: R-squared value.
    • gene_id: Gene identifier.
    • median_tpm: Median TPM (Transcripts Per Million).
    • rsid: Reference SNP ID.
  4. Permuted eQTL Information:

    • p_perm: Permuted p-value.
    • p_beta: Permuted beta value.

Example MongoDB Document Structure

{
  "study_id": "QTD000021",
  "study_name": "Sample eQTL Study",
  "samples": [
    {
      "sample_id": "sample001",
      "eqtls": [
        {
          "molecular_trait_id": "ENSG00000187583",
          "molecular_trait_object_id": "ENSG00000187583",
          "chromosome": "1",
          "position": 14464,
          "ref": "A",
          "alt": "T",
          "variant": "chr1_14464_A_T",
          "ma_samples": 41,
          "maf": 0.109948,
          "pvalue": 0.15144,
          "beta": 0.25567,
          "se": 0.17746,
          "type": "SNP",
          "aan": 42,
          "r2": 382,
          "gene_id": "ENSG00000187583",
          "median_tpm": 0.985,
          "rsid": "rs546169444",
          "permuted": {
            "p_perm": 0.000999001,
            "p_beta": 3.3243e-12
          }
        }
      ]
    }
  ]
}

Steps to Implement

  1. Extract Data:

    • Parse QTD0000*.all.tsv.gz and QTD0000*.cc.tsv.gz to extract eQTL data.
    • Parse QTD0000*.permuted.tsv.gz to extract permuted data and merge with the main eQTL data.
  2. Transform Data:

    • Normalize data fields and structure according to the MongoDB schema.
  3. Load Data:

    • Insert the structured documents into MongoDB.
    • Ensure appropriate indexes on fields such as gene_id, chromosome, position, and variant for efficient querying.
  4. API Development:

    • Develop endpoints for querying the eQTL data based on different parameters.

Indexing Strategy

karatugo commented 1 week ago

@karatugo Focus on Mongo indexing, deployment and API development