eQTL-Catalogue / eQTL-SumStats

eQTL Catalogue Summary Statistics
3 stars 1 forks source link

Major refactoring of the eQTL Catalogue API #54

Closed kauralasoo closed 1 year ago

kauralasoo commented 2 years ago

Problem statement

Problem: Current API is designed around cross-dataset queries. As a result, all of the data has to be re-indexed every time a new dataset is added. This not going to scale as we keep adding new datasets to the catalogue.

Proposed solution: Refactor to the API to only support two types of queries:

  1. Metadata about datasets
  2. Summary statistics from a single dataset

Queries

1. Metadata about datasets

Proposed endpoint name: /eqtl/api/datasets or /eqtl/api/metadata

Fields

(Based on this existing file: https://github.com/eQTL-Catalogue/eQTL-Catalogue-resources/blob/master/tabix/tabix_ftp_paths.tsv)

Should allow filtering based on dataset_id, study_id, study_name, tissue_ontology_id, tissue_label, condition label, quant_method. This table is not likely to exceed a few thousand rows.

Example queries:

  1. Return all datasets where tissue_label == blood and quant_method == ge.
  2. Return all datasets from the BLUEPRINT study (study_name == BLUEPRINT).

Todo:

  1. Kaur the generate new metadata file and assign unique ids to each dataset and study.
  2. ....

2. Summary statistics from a single dataset

Proposed endpoint name: /eqtl/api/associaitons

Queries can only be made by specifying a dataset_id, eg/eqtl/api/associaitons/QTD000001 that maps to a unique HDF5 file.

Returned fields: All fields available from the HDF5 file (https://github.com/eQTL-Catalogue/eQTL-Catalogue-resources/blob/master/tabix/Columns.md) + neg_log10_pvalue (calculated on the fly).

Probably no need to add additional dataset metdata fields (dataset_id, study_id, quant_method, etc ..)?

Filtering similar to what the current API allows.

Exact match: variant rsid molecular_trait_id gene_id chromosome

Range queries: p_lower and p_upper bp_lower and bp_upper - currently this is done on HDF5 files split by chromosome. Can we do it on a single dataset-specific HDF5 file where all chromosomes are present? This only makes sense if chromosome is already supplied. Do we need to build a joint index in HDF5 across. chromosomes and positions (as in tabix)?

jdhayhurst commented 2 years ago

I think this is a good approach.

Just to expand on the metadata endpoint (1) I think /eqtl/api/v2/datasets should return a collection of (filterable) datasets.

Example response object:

{
'datasets': [
    {
        'dataset_id': 'QTD000001',
        'study_id': 'QTS000001',
        ...
    },
    {
        'dataset_id': 'QTD000002',
        'study_id': 'QTS000002',
        ...
    }
]
}

Then there should be specific resources represented by something like this /eqtl/api/v2/datasets/QTD000001 which give more detailed responses with links:

{
    'dataset_id': 'QTD000001',
    'study_id': 'QTS000001',
    ...
    '_links': {
       'associtiations': {
            'href': '/eqtl/api/v2/datasets/QTD000001/associations'
            }
        }
}

If we are limiting the associations access to single datasets and never across datasets, I think the associations endpoint (2), is conceptually a sub-resource of the the dataset (the dataset is the parent of its metadata and association data): /eqtl/api/v2/datasets/QTD000001/associations. Happy to hear arguments against that.

Response from /eqtl/api/v2/datasets/QTD000001/associations would be a paginated response of all associations for this dataset, as before but without the metadata - metadata could be given in a link (/eqtl/api/v2/datasets/QTD000001).

Would anyone want to filter on chromosome without a bp limit? It should be feasible to combine all chromosomes and make a joint index on chromosome and position and perhaps do the positional filtering in a chr:pos-pos style?

kauralasoo commented 2 years ago

I really like the idea of having associations as a sub-resource of datasets (/eqtl/api/v2/datasets/QTD000001/associations). Supporting only dataset-specific queries also makes it straightforward to add in other types of QTLs (e.g. protein QTLs) without changing the API. We could just represent them as datasets with different quantification methods (i.e. SomaLogic or Olink).

I think most users would want to filter on chr:pos-pos. Filtering just on the chromosome would probably return too many results to be useful. That being said, one option would be to mirror the behaviour of bcftools that has a single -r parameter taking either chr, chr:pos, chr:beg or chr:beg-end: https://samtools.github.io/bcftools/bcftools.html#common_options

I would then name this as a region filter.

What are the next steps? Me and Nurlan are just finishing computing summary statistics for release 6 but we have not generated the HDF5 files yet. I think it would make sense to add in the chromosome and pos joint index before we start running the conversions.

kauralasoo commented 1 year ago

I have now completed dataset and study id assignment for the existing 127 datasets: https://github.com/eQTL-Catalogue/eQTL-Catalogue-resources/blob/master/data_tables/dataset_id_map.tsv

UPDATE 14/11/22: The latest metadata file located here: https://github.com/eQTL-Catalogue/eQTL-Catalogue-resources/blob/master/data_tables/dataset_metadata.tsv The previous file now points to a simple id mapping file.

jdhayhurst commented 1 year ago

design

jdhayhurst commented 1 year ago

HDF5 gen code changes:

kauralasoo commented 1 year ago

I have added sumstats files from a couple of datasets to this Google Drive folder: https://drive.google.com/drive/folders/1nv9ccJZe8rNDk3GeOGly-bw6wZ6u7del?usp=sharing

The dataset ids match the metadata here: https://github.com/eQTL-Catalogue/eQTL-Catalogue-resources/blob/master/data_tables/dataset_metadata.tsv

These are relatively small files. I will add some large examples by next week.

The ones that need to be converted to HDF5 have either the *.all.tsv.gz (ge quant method) or cc.tsv.gz (other quant methods) suffix. cc here stands for "connected components" and refers to the techique that we use internally to filter large exon and other summary statistics files.