Open karatugo opened 1 month ago
QTD0000*.all.tsv.gz
: Contains comprehensive eQTL data. This should be the primary source for indexing.QTD0000*.cc.tsv.gz
: Contains specific eQTL data (likely condition-specific or subset). Also useful for indexing.QTD0000*.permuted.tsv.gz
: Contains permuted eQTL data for significance testing. Useful for specific analyses but not primary indexing.Here's a refined schema to capture the necessary details from these files:
Study Information:
study_id
: QTD000021
study_name
: "Sample eQTL Study"Sample Information:
sample_id
: Auto-generated or derived from context if available?eQTL Information:
molecular_trait_id
: Corresponding trait ID.molecular_trait_object_id
: Object ID for the molecular trait.chromosome
: Chromosome number.position
: Position on the chromosome.ref
: Reference allele.alt
: Alternative allele.variant
: Variant identifier.ma_samples
: Minor allele sample count.maf
: Minor allele frequency.pvalue
: P-value of the association.beta
: Effect size.se
: Standard error.type
: Variant type (e.g., SNP).aan
: Additional annotation number.r2
: R-squared value.gene_id
: Gene identifier.median_tpm
: Median TPM (Transcripts Per Million).rsid
: Reference SNP ID.Permuted eQTL Information:
p_perm
: Permuted p-value.p_beta
: Permuted beta value.{
"study_id": "QTD000021",
"study_name": "Sample eQTL Study",
"samples": [
{
"sample_id": "sample001",
"eqtls": [
{
"molecular_trait_id": "ENSG00000187583",
"molecular_trait_object_id": "ENSG00000187583",
"chromosome": "1",
"position": 14464,
"ref": "A",
"alt": "T",
"variant": "chr1_14464_A_T",
"ma_samples": 41,
"maf": 0.109948,
"pvalue": 0.15144,
"beta": 0.25567,
"se": 0.17746,
"type": "SNP",
"aan": 42,
"r2": 382,
"gene_id": "ENSG00000187583",
"median_tpm": 0.985,
"rsid": "rs546169444",
"permuted": {
"p_perm": 0.000999001,
"p_beta": 3.3243e-12
}
}
]
}
]
}
Extract Data:
QTD0000*.all.tsv.gz
and QTD0000*.cc.tsv.gz
to extract eQTL data.QTD0000*.permuted.tsv.gz
to extract permuted data and merge with the main eQTL data.Transform Data:
Load Data:
gene_id
, chromosome
, position
, and variant
for efficient querying.API Development:
gene_id
chromosome
position
variant
rsid
@karatugo Focus on Mongo indexing, deployment and API development
We need to develop a robust and scalable data ingest/ETL (Extract, Transform, Load) pipeline to facilitate the reading of eQTL (expression Quantitative Trait Loci) data from FTP sources, indexing it into a MongoDB database, and serving it via an API. This pipeline will ensure efficient data extraction, transformation, and retrieval to support downstream analysis and querying through a web service.