This repository provides a SQLite database designed to facilitate search for GWAS records in the GWAS Catalog database—the NHGRI-EBI Catalog of human genome-wide association studies. This is achieved by combining the EFO ontology mappings specified in the GWAS Catalog metadata with tabular representations of ontology relationships—extracted from a SemanticSQL database representation of EFO—such that users can search for GWAS Catalog records by leveraging the EFO class hierarchy.
Resource | Version |
---|---|
SearchDB | 0.10.0 |
EFO | 3.62.0 |
UBERON | 2024-01-18 |
Studies | 2024-02-06T15:02:09 |
Associations | 2024-02-06T15:06:22 |
The database is contained in the compressed file gwascatalog_search.db.tar.xz
. After unpacking it, the database can be queried directly using any SQL database client.
git clone https://github.com/ccb-hms/GWASCatalogSearchDB.git # clone the repository
tar xf gwascatalog_search.db.tar.xz # unpack compressed database file
cd GWASCatalogSearchDB/src # change to code folder
python3 query_database.py # run example queries
The database contains the tables depicted and described below.
version_info
contains ontology and DB versions, and download timestamps of GWAS Catalog tables.gwascatalog_metadata
contains the GWAS Catalog table All studies v1.0.2.gwascatalog_associations
contains some columns from the table All associations v1.0.2. gwascatalog_references
contains details obtained from PubMed about the articles in the PUBMEDID
column of the metadata table. gwascatalog_mappings
contains ontology mappings extracted from gwascatalog_metadata
with each mapping in a separate row. In the original metadata there are often multiple ontology mappings provided as comma-separated strings.efo_labels
contains the following details:
Subject
column). Object
column). IRI
column).DiseaseLocation
column). Direct
column). Inherited
column).efo_synonyms
contains the potentially multiple synonyms (in the Object
column) of each EFO term (given in the Subject
column).efo_edges
and efo_entailed_edges
contain, respectively, the asserted and entailed IS-A/SubClassOf relationships in EFO of the form Subject IS-A Object
, where Subject
—the child/subclass term—is represented in the 'Subject' column. And Object
—the parent/superclass term—is represented in the 'Object' column.
efo_edges
allows querying for direct parents of a term, e.g., SELECT Object FROM efo_edges WHERE Subject='EFO:1000652' ('acute pancreatitis')
returns:Object |
---|
EFO:0000278 ('pancreatitis') |
MONDO:0020683 ('acute disease') |
efo_entailed_edges
allows querying for entailed ancestors of a term, e.g., SELECT Object FROM efo_entailed_edges WHERE Subject='EFO:1000652' ('acute pancreatitis')
returns:Object |
---|
EFO:0000278 ('pancreatitis') |
MONDO:0020683 ('acute disease') |
EFO:0009605 ('pancreas disease') |
EFO:0001379 ('endocrine system disease') |
EFO:0009903 ('inflammatory disease') |
EFO:0000405 ('digestive system disease') |
EFO:0000408 ('disease') |
BFO:0000016 ('disposition') |
BFO:0000020 ('material property') |
EFO:0000001 ('experimental factor') |
So efo_entailed_edges
contains, for a given Subject, all parent terms up to the root of the ontology. The table is called “entailed” because it potentially includes parents derived after reasoning over the ontology, which would not be surfaced by recursively searching over efo_edges
.
We programmatically obtain two tables from GWAS Catalog's Downloads webpage: 'All studies v1.0.2' and 'All associations v1.0.2', and then we store them in our database without modification as gwascatalog_metadata
and gwascatalog_associations
, respectively. The first table contains details about GWAS studies registered in the GWAS Catalog, while the second contains details about the SNP-trait associations extracted from those studies.
In the gwascatalog_metadata
table, there are sometimes multiple ontology mappings for a single study, which are represented as a comma-separated list of CURIEs in each study's row. This representation makes search over such values challenging. So, from the gwascatalog_metadata
table, we extract all ontology mappings and with them we create another table called gwascatalog_mappings
, where each ontology mapping is represented in its own row. The rationale behind this new table is that it can be extended to include additional mappings from different sources (as we have done in the latest version of our database), and then users can select their preferred mapping source(s) to use for search.
From the gwascatalog_metadata
we extract the PubMed ID associated with each study, and we use the metapub Python package to extract publication details such as titles, abstracts, and journal names, which we then store in the table gwascatalog_references
.
For the main ontology used to annotate traits in the GWAS Catalog, EFO, our database contains a table called efo_labels
with details about all terms in EFO, such as their labels, identifiers, and disease locations associated with them (if available). In this table we also include the count of how many GWAS studies are directly mapped to each ontology term, and of how many studies are indirectly mapped to each term, i.e., they are mapped to a more specific term in the hierarchy.
Our database includes a tabular representation of an ontology's class hierarchy before and after reasoning, in the tables efo_edges
and efo_entailed_edges
, respectively. So for a particular term, we can obtain its direct, asserted parents using efo_edges
, and we can obtain its indirect, entailed parents (i.e., ancestors) using efo_entailed_edges
. The table efo_synonyms
contains the potentially multiple synonyms of each EFO term. Implementation-wise, we build our tables by extracting the minimum required details from a SemanticSQL representation of EFO.
Our pipeline allows the inclusion of any additional user-specified ontologies. In our case, we include the Uberon ontology so that we can obtain more details about the anatomical structures associated with (EFO) diseases. The tables included in the database for such extra ontologies are the same as for the primary annotation ontology, except that the labels table does not include disease location details or mapping counts.
Finally we include a table in our database specifying the versions of the ontologies used in the pipeline, and the dates and times when the metadata tables were downloaded from the GWAS Catalog.
The disease locations contained in the DB are extracted directly from EFO statements of the form: X has_disease_location Y
, where Y is (typically) an UBERON term representing an anatomical location. If a term does not have an explicit has_disease_location
relationship, we determine if it has an inferred one. This is done by recursively checking if a parent in the ontology hierarchy has such a location, until one (or none) location is found. For example, 'bronchitis' (EFO:0009661)
does not have an explicitly stated location in EFO, but it inherits one from its immediate parent 'bronchial disease', which has a _'has_diseaselocation' relationship to 'bronchus'.
src/query_database.py
contains a search function (described below) to query the gwascatalog_search.db
database for records annotated/mapped to a user-specified set of EFO traits.
# search for GWAS Catalog records annotated with pancreatic or infectious disease
resources_annotated_with_terms(search_terms=['EFO:0009605', 'EFO:0005741'],
include_subclasses=True,
direct_subclasses_only=False)
The function parameters are:
db_cursor
— cursor for database connectionsearch_terms
— a collection of ontology terms to search oninclude_subclasses
— include resources annotated with subclasses of the given search terms,
otherwise only resources explicitly annotated with those terms are returneddirect_subclasses_only
— include only the direct subclasses of the given search terms,
otherwise all the resources annotated with entailed subclasses of the given terms are returnedEach search term must be an EFO term specified by its compact uniform resource identifier (CURIE). For example EFO:0005741
is the short form of http://www.ebi.ac.uk/efo/EFO_0005741.
Here we exemplify the different possible search options, using 'EFO:0009605' (pancreas disease)
as an example search term.
To obtain GWAS Catalog records that are explicitly annotated with a search term, we call the function as such:
resources_annotated_with_terms(search_terms=['EFO:0009605'],
include_subclasses=False,
direct_subclasses_only=False)
Result = 11 records. The underlying query performs a lookup over the gwascatalog_mappings
table to find records where MAPPED_TRAIT_CURIE = 'EFO:0009605'
.
To include resources annotated with EFO terms that are direct subclasses of the search term, we set include_subclasses=True
to include subclasses and then direct_subclasses_only=True
.
resources_annotated_with_terms(search_terms=['EFO:0009605'],
include_subclasses=True,
direct_subclasses_only=True)
Result = 21 records. The underlying query performs a lookup over the efo_edges
table to find terms in the Subject
column (call these Sub
) that have in the Object
column the search term EFO:0009605
. Then the function returns the resources mapped to any Sub
term.
To include resources annotated with the search term or any of its inherited/entailed subclasses, we set direct_subclasses_only=False
, as such:
resources_annotated_with_terms(search_terms=['EFO:0009605'],
include_subclasses=True,
direct_subclasses_only=True)
Result = 408 records. The underlying query performs a lookup over the efo_entailed_edges
table to find Subject
terms Sub
where Object='EFO:0009605'
, and then returns resources mapped to any Sub
term. Because the entailed_edges table contains all entailed parents (via reasoning) for all ontology terms, this query returns resources annotated with any term that is entailed to be a subclass of the search term.
The database can be built from scratch by running the Python module below.
cd src
python3 build_gwascatalog_db.py
This generates gwascatalog_search.db.tar.xz
containing the SQLite3 database gwascatalog_search.db
.