Major refactoring of the eQTL Catalogue API

kauralasoo commented 2 years ago

Problem statement

Problem: Current API is designed around cross-dataset queries. As a result, all of the data has to be re-indexed every time a new dataset is added. This not going to scale as we keep adding new datasets to the catalogue.

Proposed solution: Refactor to the API to only support two types of queries:

Metadata about datasets
Summary statistics from a single dataset

Queries

1. Metadata about datasets

Proposed endpoint name: /eqtl/api/datasets or /eqtl/api/metadata

Fields

(Based on this existing file: https://github.com/eQTL-Catalogue/eQTL-Catalogue-resources/blob/master/tabix/tabix_ftp_paths.tsv)

dataset_id - unique id assigned to each combination of study_id, sample_group (previously qtl_group) and quant_method. Each unique id will correspond to a single HDF5 file. Proposed format: QTD000001.
study_id - unique id assigned to each study. Proposed format QTS000001.
study_name - human readable name of the study, eg GTEx or Lepik_2017 (previously study).
sample_group - name for the subgroup of samples used for QTL mapping (previously qtl_group).
sample_size - number of samples in the sample_group (integer).
tissue_id - mapped tissue ontology id (previously tissue_ontology_id).
tissue_label - short, human-readable tissue ontology label
condition_label - short, human readable condition label
quant_method - quantification method

Should allow filtering based on dataset_id, study_id, study_name, tissue_ontology_id, tissue_label, condition label, quant_method. This table is not likely to exceed a few thousand rows.

Example queries:

Return all datasets where tissue_label == blood and quant_method == ge.
Return all datasets from the BLUEPRINT study (study_name == BLUEPRINT).

Todo:

Kaur the generate new metadata file and assign unique ids to each dataset and study.
....

2. Summary statistics from a single dataset

Proposed endpoint name: /eqtl/api/associaitons

Queries can only be made by specifying a dataset_id, eg/eqtl/api/associaitons/QTD000001 that maps to a unique HDF5 file.

Returned fields: All fields available from the HDF5 file (https://github.com/eQTL-Catalogue/eQTL-Catalogue-resources/blob/master/tabix/Columns.md) + neg_log10_pvalue (calculated on the fly).

Probably no need to add additional dataset metdata fields (dataset_id, study_id, quant_method, etc ..)?

Filtering similar to what the current API allows.

Exact match: variant rsid molecular_trait_id gene_id chromosome

Range queries: p_lower and p_upper bp_lower and bp_upper - currently this is done on HDF5 files split by chromosome. Can we do it on a single dataset-specific HDF5 file where all chromosomes are present? This only makes sense if chromosome is already supplied. Do we need to build a joint index in HDF5 across. chromosomes and positions (as in tabix)?

jdhayhurst commented 2 years ago

I think this is a good approach.

Just to expand on the metadata endpoint (1) I think /eqtl/api/v2/datasets should return a collection of (filterable) datasets.

Example response object:

{
'datasets': [
    {
        'dataset_id': 'QTD000001',
        'study_id': 'QTS000001',
        ...
    },
    {
        'dataset_id': 'QTD000002',
        'study_id': 'QTS000002',
        ...
    }
]
}

Then there should be specific resources represented by something like this /eqtl/api/v2/datasets/QTD000001 which give more detailed responses with links:

{
    'dataset_id': 'QTD000001',
    'study_id': 'QTS000001',
    ...
    '_links': {
       'associtiations': {
            'href': '/eqtl/api/v2/datasets/QTD000001/associations'
            }
        }
}

If we are limiting the associations access to single datasets and never across datasets, I think the associations endpoint (2), is conceptually a sub-resource of the the dataset (the dataset is the parent of its metadata and association data): /eqtl/api/v2/datasets/QTD000001/associations. Happy to hear arguments against that.

Response from /eqtl/api/v2/datasets/QTD000001/associations would be a paginated response of all associations for this dataset, as before but without the metadata - metadata could be given in a link (/eqtl/api/v2/datasets/QTD000001).

Would anyone want to filter on chromosome without a bp limit? It should be feasible to combine all chromosomes and make a joint index on chromosome and position and perhaps do the positional filtering in a chr:pos-pos style?

kauralasoo commented 2 years ago

I really like the idea of having associations as a sub-resource of datasets (/eqtl/api/v2/datasets/QTD000001/associations). Supporting only dataset-specific queries also makes it straightforward to add in other types of QTLs (e.g. protein QTLs) without changing the API. We could just represent them as datasets with different quantification methods (i.e. SomaLogic or Olink).

I think most users would want to filter on chr:pos-pos. Filtering just on the chromosome would probably return too many results to be useful. That being said, one option would be to mirror the behaviour of bcftools that has a single -r parameter taking either chr, chr:pos, chr:beg or chr:beg-end: https://samtools.github.io/bcftools/bcftools.html#common_options

I would then name this as a region filter.

What are the next steps? Me and Nurlan are just finishing computing summary statistics for release 6 but we have not generated the HDF5 files yet. I think it would make sense to add in the chromosome and pos joint index before we start running the conversions.

kauralasoo commented 1 year ago

I have now completed dataset and study id assignment for the existing 127 datasets: https://github.com/eQTL-Catalogue/eQTL-Catalogue-resources/blob/master/data_tables/dataset_id_map.tsv

UPDATE 14/11/22: The latest metadata file located here: https://github.com/eQTL-Catalogue/eQTL-Catalogue-resources/blob/master/data_tables/dataset_metadata.tsv The previous file now points to a simple id mapping file.

jdhayhurst commented 1 year ago

design

/eqtl/api/v2/datasets endpoint returns datasets (ids) with metadata
/eqtl/api/v2/datasets/<qtd> returns metadata for a dataset id
/eqtl/api/v2/datasets/<qtd>/associations returns associations for dataset id
- params:
- pos=<chr>:<start>-<end>
- nlog10p=<lower_threshold>
- rsid=<rsID>
- variant_id=<chr_pos_ref_alt>
- molecular_trait_id=<molecular_trait_id>
- gene_id=<gene_id>

jdhayhurst commented 1 year ago

HDF5 gen code changes:

all chroms in single dataset file - need to index on chrom
no need to replicate data in another way (outside of the dataset files)

kauralasoo commented 1 year ago

I have added sumstats files from a couple of datasets to this Google Drive folder: https://drive.google.com/drive/folders/1nv9ccJZe8rNDk3GeOGly-bw6wZ6u7del?usp=sharing

The dataset ids match the metadata here: https://github.com/eQTL-Catalogue/eQTL-Catalogue-resources/blob/master/data_tables/dataset_metadata.tsv

These are relatively small files. I will add some large examples by next week.

The ones that need to be converted to HDF5 have either the *.all.tsv.gz (ge quant method) or cc.tsv.gz (other quant methods) suffix. cc here stands for "connected components" and refers to the techique that we use internally to filter large exon and other summary statistics files.

eQTL-Catalogue / eQTL-SumStats