GWASCatalogSearchDB

This repository provides a SQLite database designed to facilitate search for GWAS records in the GWAS Catalog database—the NHGRI-EBI Catalog of human genome-wide association studies. This is achieved by combining the EFO ontology mappings specified in the GWAS Catalog metadata with tabular representations of ontology relationships—extracted from a SemanticSQL database representation of EFO—such that users can search for GWAS Catalog records by leveraging the EFO class hierarchy.

Latest database build

Resource	Version
SearchDB	0.10.0
EFO	3.62.0
UBERON	2024-01-18
Studies	2024-02-06T15:02:09
Associations	2024-02-06T15:06:22

Using the database

The database is contained in the compressed file gwascatalog_search.db.tar.xz. After unpacking it, the database can be queried directly using any SQL database client.

git clone https://github.com/ccb-hms/GWASCatalogSearchDB.git  # clone the repository
tar xf gwascatalog_search.db.tar.xz  # unpack compressed database file 
cd GWASCatalogSearchDB/src  # change to code folder
python3 query_database.py  # run example queries

The database contains the tables depicted and described below.

version_info contains ontology and DB versions, and download timestamps of GWAS Catalog tables.
gwascatalog_metadata contains the GWAS Catalog table All studies v1.0.2.
gwascatalog_associations contains some columns from the table All associations v1.0.2.
gwascatalog_references contains details obtained from PubMed about the articles in the PUBMEDID column of the metadata table.
gwascatalog_mappings contains ontology mappings extracted from gwascatalog_metadata with each mapping in a separate row. In the original metadata there are often multiple ontology mappings provided as comma-separated strings.
efo_labels contains the following details:
- all terms in EFO, represented by their CURIEs (Subject column).
- term labels (Object column).
- term IRIs (IRI column).
- disease locations associated with each term, if available (DiseaseLocation column).
- count of how many metadata points are directly mapped to those ontology terms (Direct column).
- count of how many metadata points are indirectly mapped to those terms via a more specific term in the hierarchy (Inherited column).
efo_synonyms contains the potentially multiple synonyms (in the Object column) of each EFO term (given in the Subject column).

efo_edges and efo_entailed_edges contain, respectively, the asserted and entailed IS-A/SubClassOf relationships in EFO of the form Subject IS-A Object, where Subject—the child/subclass term—is represented in the 'Subject' column. And Object—the parent/superclass term—is represented in the 'Object' column.

efo_edges allows querying for direct parents of a term, e.g., SELECT Object FROM efo_edges WHERE Subject='EFO:1000652' ('acute pancreatitis') returns:

Object
EFO:0000278 ('pancreatitis')
MONDO:0020683 ('acute disease')

efo_entailed_edges allows querying for entailed ancestors of a term, e.g., SELECT Object FROM efo_entailed_edges WHERE Subject='EFO:1000652' ('acute pancreatitis') returns:

Object
EFO:0000278 ('pancreatitis')
MONDO:0020683 ('acute disease')
EFO:0009605 ('pancreas disease')
EFO:0001379 ('endocrine system disease')
EFO:0009903 ('inflammatory disease')
EFO:0000405 ('digestive system disease')
EFO:0000408 ('disease')
BFO:0000016 ('disposition')
BFO:0000020 ('material property')
EFO:0000001 ('experimental factor')

So efo_entailed_edges contains, for a given Subject, all parent terms up to the root of the ontology. The table is called “entailed” because it potentially includes parents derived after reasoning over the ontology, which would not be surfaced by recursively searching over efo_edges.

ETL Process Overview

We programmatically obtain two tables from GWAS Catalog's Downloads webpage: 'All studies v1.0.2' and 'All associations v1.0.2', and then we store them in our database without modification as gwascatalog_metadata and gwascatalog_associations, respectively. The first table contains details about GWAS studies registered in the GWAS Catalog, while the second contains details about the SNP-trait associations extracted from those studies.

In the gwascatalog_metadata table, there are sometimes multiple ontology mappings for a single study, which are represented as a comma-separated list of CURIEs in each study's row. This representation makes search over such values challenging. So, from the gwascatalog_metadata table, we extract all ontology mappings and with them we create another table called gwascatalog_mappings, where each ontology mapping is represented in its own row. The rationale behind this new table is that it can be extended to include additional mappings from different sources (as we have done in the latest version of our database), and then users can select their preferred mapping source(s) to use for search.

From the gwascatalog_metadata we extract the PubMed ID associated with each study, and we use the metapub Python package to extract publication details such as titles, abstracts, and journal names, which we then store in the table gwascatalog_references.

For the main ontology used to annotate traits in the GWAS Catalog, EFO, our database contains a table called efo_labels with details about all terms in EFO, such as their labels, identifiers, and disease locations associated with them (if available). In this table we also include the count of how many GWAS studies are directly mapped to each ontology term, and of how many studies are indirectly mapped to each term, i.e., they are mapped to a more specific term in the hierarchy.

Our database includes a tabular representation of an ontology's class hierarchy before and after reasoning, in the tables efo_edges and efo_entailed_edges, respectively. So for a particular term, we can obtain its direct, asserted parents using efo_edges, and we can obtain its indirect, entailed parents (i.e., ancestors) using efo_entailed_edges. The table efo_synonyms contains the potentially multiple synonyms of each EFO term. Implementation-wise, we build our tables by extracting the minimum required details from a SemanticSQL representation of EFO.

Our pipeline allows the inclusion of any additional user-specified ontologies. In our case, we include the Uberon ontology so that we can obtain more details about the anatomical structures associated with (EFO) diseases. The tables included in the database for such extra ontologies are the same as for the primary annotation ontology, except that the labels table does not include disease location details or mapping counts.

Finally we include a table in our database specifying the versions of the ontologies used in the pipeline, and the dates and times when the metadata tables were downloaded from the GWAS Catalog.

Extracting disease locations related to EFO terms

The disease locations contained in the DB are extracted directly from EFO statements of the form: X has_disease_location Y, where Y is (typically) an UBERON term representing an anatomical location. If a term does not have an explicit has_disease_location relationship, we determine if it has an inferred one. This is done by recursively checking if a parent in the ontology hierarchy has such a location, until one (or none) location is found. For example, 'bronchitis' (EFO:0009661) does not have an explicitly stated location in EFO, but it inherits one from its immediate parent 'bronchial disease', which has a _'has_diseaselocation' relationship to 'bronchus'.

Querying the database

src/query_database.py contains a search function (described below) to query the gwascatalog_search.db database for records annotated/mapped to a user-specified set of EFO traits.

# search for GWAS Catalog records annotated with pancreatic or infectious disease
resources_annotated_with_terms(search_terms=['EFO:0009605', 'EFO:0005741'],
                               include_subclasses=True, 
                               direct_subclasses_only=False)

The function parameters are:

db_cursor— cursor for database connection
search_terms— a collection of ontology terms to search on
include_subclasses— include resources annotated with subclasses of the given search terms, otherwise only resources explicitly annotated with those terms are returned
direct_subclasses_only— include only the direct subclasses of the given search terms, otherwise all the resources annotated with entailed subclasses of the given terms are returned

Each search term must be an EFO term specified by its compact uniform resource identifier (CURIE). For example EFO:0005741 is the short form of http://www.ebi.ac.uk/efo/EFO_0005741.

Examples

Here we exemplify the different possible search options, using 'EFO:0009605' (pancreas disease) as an example search term.

Obtaining resources mapped directly to a search term

To obtain GWAS Catalog records that are explicitly annotated with a search term, we call the function as such:

resources_annotated_with_terms(search_terms=['EFO:0009605'],
                               include_subclasses=False, 
                               direct_subclasses_only=False)

Result = 11 records. The underlying query performs a lookup over the gwascatalog_mappings table to find records where MAPPED_TRAIT_CURIE = 'EFO:0009605'.

Obtaining resources mapped to direct subclasses of a search term

To include resources annotated with EFO terms that are direct subclasses of the search term, we set include_subclasses=True to include subclasses and then direct_subclasses_only=True.

resources_annotated_with_terms(search_terms=['EFO:0009605'],
                               include_subclasses=True, 
                               direct_subclasses_only=True)

Result = 21 records. The underlying query performs a lookup over the efo_edges table to find terms in the Subject column (call these Sub) that have in the Object column the search term EFO:0009605. Then the function returns the resources mapped to any Sub term.

Obtaining resources mapped to any subclass of a search term

To include resources annotated with the search term or any of its inherited/entailed subclasses, we set direct_subclasses_only=False, as such:

resources_annotated_with_terms(search_terms=['EFO:0009605'],
                               include_subclasses=True, 
                               direct_subclasses_only=True)

Result = 408 records. The underlying query performs a lookup over the efo_entailed_edges table to find Subject terms Sub where Object='EFO:0009605', and then returns resources mapped to any Sub term. Because the entailed_edges table contains all entailed parents (via reasoning) for all ontology terms, this query returns resources annotated with any term that is entailed to be a subclass of the search term.

Building the database

The database can be built from scratch by running the Python module below.

cd src
python3 build_gwascatalog_db.py

This generates gwascatalog_search.db.tar.xz containing the SQLite3 database gwascatalog_search.db.

ccb-hms / GWASCatalogSearchDB

readme