ccb-hms / GWASCatalogSearchDB

0 stars 0 forks source link

can't replicate the data in efo_labels #6

Closed rgentlem closed 1 year ago

rgentlem commented 1 year ago

Hi Rafael and Jason,

I see this in the efo_labels table: (renamed in r to efo_df), but there are 5 direct and 362 indirect. efo_df["EFO:0005140",] Subject Object IRI DiseaseLocation Direct Inherited EFO:0005140 EFO:0005140 autoimmune disease http://www.ebi.ac.uk/efo/EFO_0005140 5 362 Synonyms EFO:0005140 autoimmunity Matches EFO:0005140 Pediatric autoimmune diseases : Autoimmune traits : Autoimmune traits (pleiotropy) : Autoimmune traits : Autoimmune disease and mental disorder

When I run the db query (in R), using what I think is essentially the code in your file: query_database.py

ax = resources_annotated_with_term("EFO:0005140", include_subclasses = FALSE)

dim(ax) [1] 3 5 efo_df[1,] Subject Object ORPHANET.ORDO:100006 ORPHANET.ORDO:100006 Hereditary cerebral hemorrhage with amyloidosis, Dutch type IRI DiseaseLocation Direct Inherited ORPHANET.ORDO:100006 http://www.orpha.net/ORDO/Orphanet_100006 0 0 Synonyms Matches ORPHANET.ORDO:100006 HCHWA-D : HCHWA, Dutch type

Do you happen to have a test suite that compares the values in efo_labels file to some specific set of queries from the gwascatalog_metadata table? I would find it helpful if we had a couple of tests that we could just run every time we update files so I can be sure I have copied the metadata and other files into the same place. And to make sure that derived results, such as Direct and Inherited counts are accurate.

rsgoncalves commented 1 year ago

A couple of issues are occurring here: 1) Re: count of direct mappings—The original gwascatalog_metadata table contains multiple ontology mappings (in MAPPED_TRAIT column) for the same row. So the direct mappings obtained from searching the gwascatalog_metadata table will be less than expected. By searching gwascatalog_mappings instead, we get the correct (5) resources directly mapped to EFO:0005140.

2) Re: count of inherited mappings—This issue is also related to the occurrence of multiple mappings per study. For example, GCST003097 has 6 ontology mappings to different kinds of autoimmune disease. Querying the gwascatalog_metadata will get us less studies than expected because we miss out the rows that have multiple mappings in them. If we modify the query to select from the gwascatalog_mappings table we get the correct studies, but the result set includes multiple rows (all 6, in our example) for multiple mappings of the same study. So the problem is in the SQL query... which we'll fix. Another contributor to the discrepancy in counts is that the counting code was given a more recent version of EFO than the one used everywhere else in the pipeline. Meanwhile I've also checked that the count of Inherited mappings given in efo_labels is correct— initially I thought the problem might be in the counting code, but it checks out, and give us the number of unique studies by their STUDY.ACCESSION ID.

We will put some effort in creating a test suite for the resource to flag issues such as these.

rsgoncalves commented 1 year ago

The following seems to be giving the expected results for all (inherited) mappings to EFO:0005140.

SELECT DISTINCT
    study.`STUDY.ACCESSION`,
    study.`DISEASE.TRAIT`,
    study.MAPPED_TRAIT,
    study.MAPPED_TRAIT_URI
FROM
    `gwascatalog_metadata` study
WHERE
    study.`STUDY.ACCESSION` IN (
        SELECT DISTINCT mapping.`STUDY.ACCESSION`
        FROM `gwascatalog_mappings` mapping
                 LEFT JOIN efo_entailed_edges ee ON (mapping.MAPPED_TRAIT_CURIE = ee.Subject)
        WHERE (mapping.MAPPED_TRAIT_CURIE = 'EFO:0005140' OR ee.Object = 'EFO:0005140')
    );
rgentlem commented 1 year ago

thanks - I will look for an update in the src/query_database.py function so that I can emulate it. Are you planning to produce a new gwascatalog_metadata table to incorportate the new EFO (more recent?) EFO into all parts of the pipeline? also please let me know where the tests will be detailed, so I can try to replicate appropriate ones to R

rsgoncalves commented 1 year ago

For now the next release will still be based on the EFO version currently provided in the SemanticSQL build of EFO we are using. There was just one module in our pipeline that was mistakenly using the latest EFO. I've posted an issue to see if these DB builds are getting updated regularly, but perhaps we will implement an alternative mechanism such that we can keep up to date with ontology versions.

rgentlem commented 1 year ago

thanks - I think that should be listed somewhere as a liability. We likely need to build our own versions of these DBs if/when our tools start to see much use. For now, I think we can live with what we have.

I think there is still a need to update the code in query_database.py https://github.com/ccb-hms/GWASCatalogSearchDB/blob/main/src/query_database.py to reflect the code snippet above and hence get the right mapping counts, or am I missing something?

On Tue, Aug 15, 2023 at 8:11 PM Rafael Gonçalves @.***> wrote:

For now the next release will still be based on the EFO version currently provided in the SemanticSQL build of EFO we are using. There was just one module in our pipeline that was mistakenly using the latest EFO. I've posted an issue to see if these DB builds are getting updated regularly, but perhaps we will implement an alternative mechanism such that we can keep up to date with ontology versions.

— Reply to this email directly, view it on GitHub https://github.com/ccb-hms/GWASCatalogSearchDB/issues/6#issuecomment-1679785403, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC7TWA6CSDPBD5I3FCJQZBTXVQF2FANCNFSM6AAAAAA3PWODYA . You are receiving this because you authored the thread.Message ID: @.***>

-- Robert Gentleman @.***

rsgoncalves commented 1 year ago

I agree and I think we can replace that external dependency without a ton of effort — we have a module in text2term that can be extended to get the details we are currently fetching from these external DBs (that text2term doesn't already collect).

The SQL in query_database is now updated, and a new version of the whole database is out:

https://github.com/ccb-hms/GWASCatalogSearchDB/releases/tag/v0.5.0 which should resolve this issue.

Please have a look and let me know if you still encounter discrepancies in the numbers.