NCATS-Tangerine / translator-api-registry

This repo hosts the API metadata for the Translator project
13 stars 31 forks source link

AGR API Yaml #127

Closed mnarayan1 closed 11 months ago

mnarayan1 commented 1 year ago

AGR API yaml file, for gene-disease relationships. Addresses this issue.

Notes:

Problems: Using this API record, I'm assuming that querying the gene FB:FBgn0038376 should return the disease DOID:9970 (dyschromatosis universalis hereditaria). This is the query I ran:

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "categories": ["biolink:Gene"],
                    "ids": ["FB:FBgn0038376"]
                },
                "n1": {
                    "categories": ["biolink:Disease"]
                }
            },
            "edges": {
                "e01": {
                    "subject": "n0",
                    "object": "n1"
                }
            }
        }
    }
}

However, BTE is retrieving 0 successful results. My local installation of BTE is working fine, so I'm assuming that something is wrong with the annotations themselves. How can I fix this?

andrewsu commented 1 year ago

@mnarayan1 on quick glance, your TRAPI query and your smartAPI annotation look good to me. When you say "My local installation of BTE is working fine" I assume you've gotten local overrides working on your local instance? And do you see zero results for other gene identifiers (e.g., HGNC, wormbase, xenbase, etc.)?

mnarayan1 commented 1 year ago

@andrewsu The other gene identifiers are not working either. I have local overrides working on my local instance, and BTE was able to successfully load AGR into smartapi_specs.

Here is the message I get when I try to run the above query. ``` { "description": "Query processed successfully, retrieved 0 results.", "schema_version": "1.4.0", "biolink_version": "3.5.0", "workflow": [ { "id": "lookup" } ], "message": { "query_graph": { "nodes": { "n0": { "ids": [ "MGI:1096330" ] }, "n1": { "categories": [ "biolink:Disease" ] } }, "edges": { "e01": { "subject": "n0", "object": "n1" } } }, "knowledge_graph": { "nodes": {}, "edges": {} }, "results": [] }, "logs": [ { "timestamp": "2023-07-31T17:29:29.435Z", "level": "INFO", "message": "Expanded ids for node n0: (1 ids -> 1 ids)", "code": null }, { "timestamp": "2023-07-31T17:29:32.416Z", "level": "INFO", "message": "Node n0 with id [MGI:1096330] assigned category [biolink:Gene] inferred from id.", "code": null }, { "timestamp": "2023-07-31T17:29:32.417Z", "level": "DEBUG", "message": "BTE identified 2 qNodes from your query graph", "code": null }, { "timestamp": "2023-07-31T17:29:32.417Z", "level": "DEBUG", "message": "BTE identified 1 qEdges from your query graph", "code": null }, { "timestamp": "2023-07-31T17:29:32.426Z", "level": "DEBUG", "message": "Edge manager is managing 1 qEdges.", "code": null }, { "timestamp": "2023-07-31T17:29:32.426Z", "level": "DEBUG", "message": "Edge manager is sending next qEdge 'e01' for execution.", "code": null }, { "timestamp": "2023-07-31T17:29:32.426Z", "level": "INFO", "message": "Executing e01: n0 --> n1", "code": null }, { "timestamp": "2023-07-31T17:29:32.801Z", "level": "DEBUG", "message": "REDIS cache is not enabled.", "code": null }, { "timestamp": "2023-07-31T17:29:32.802Z", "level": "DEBUG", "message": "BTE is trying to find metaKG edges (smartAPI registry, x-bte annotation) connecting from Gene to Disease with predicate undefined", "code": null }, { "timestamp": "2023-07-31T17:29:32.817Z", "level": "DEBUG", "message": "BTE found 9 metaKG edges corresponding to e01. These metaKG edges comes from 1 unique APIs. They are BioThings AGR API", "code": null }, { "timestamp": "2023-07-31T17:29:32.820Z", "level": "DEBUG", "message": "BTE found 1 metaKG for this batch.", "code": null }, { "timestamp": "2023-07-31T17:29:32.820Z", "level": "DEBUG", "message": "Resolving ID feature is turned on", "code": null }, { "timestamp": "2023-07-31T17:29:32.820Z", "level": "DEBUG", "message": "call-apis: 1 planned queries for edge e01", "code": null }, { "timestamp": "2023-07-31T17:29:33.827Z", "level": "DEBUG", "message": "Successful POST https://biothings.ncats.io/agr (1 ID): Gene > gene_associated_with_condition > Disease (obtained 0 records, took 967ms)", "code": null }, { "timestamp": "2023-07-31T17:29:33.827Z", "level": "DEBUG", "message": "call-apis: Total number of records returned for this query is 0", "code": null }, { "timestamp": "2023-07-31T17:29:33.827Z", "level": "DEBUG", "message": "call-apis: qEdge queries complete in 1s", "code": null }, { "timestamp": "2023-07-31T17:29:33.828Z", "level": "INFO", "message": "e01 execution: 1 queries (1 success/0 fail) and (0) cached qEdges return (0) records", "code": null }, { "timestamp": "2023-07-31T17:29:33.829Z", "level": "WARNING", "message": "qEdge (e01) got 0 records. Your query terminates.", "code": null } ] } ```
colleenXu commented 12 months ago

@mnarayan1

Sorry for such a belated response. Are you still available to work on this issue? If not, it's not a problem - I'll merge the PR which will preserve the record of work you've done, then add commits...

I've found the reasons why the x-bte annotation wasn't working, and I have a list of proposed fixes (the minimum needed to get the annotation working)

(1) writing separate sets of operations for each data subset/gene-ID-namespace combo

This is necessary for current x-bte annotation because the different data subsets represent different relationships that we can assign different biolink predicates to. Also, the different ID-namespaces need to be handled differently (see next points)... Notes: * that could mean a combinatorial explosion of operations >.<. We can cut down by only writing operations if they cover > 5 records/documents. * there's 4 data subsets that we could annotate (not negation: `agr.biomarker_via_orthology`, `agr.implicated_via_orthology`, `agr.is_implicated_in`, `agr.is_marker_for`) * multiple gene ID-namespaces involved (MGI, RGD, SGD, etc). Madhumita has already listed them in yaml comments

(2) for requestBody.body.q: use replPrefix() so BTE adds the prefixes needed for the querying

BTE doesn't always automatically add prefixes to IDs when generating the queries. It looks like for this API, all the IDs (field `_id`) have prefixes that need adding (gene namespaces and DOID) Example: ``` requestBody: body: ## API data has prefix ## joinSafe is only needed if the delimiter isn't a comma q: "{{ queryInputs | replPrefix('MGI') }}" scopes: _id ```

(3) parameters.fields adjustments

* fields (besides _id) are missing the root field: they should start with `agr.` * We can add the `agr.symbol` for each operation. This may be useful since Translator's [NodeNorm](https://smart-api.info/ui/400f7c11028ff36f460af4ea85dc72f5) may not support every namespace (could check [here](https://github.com/biothings/biothings_explorer/issues/735#issuecomment-1751919208) or put IDs into the endpoint)

(4) response-mapping adjustments

Right now, it doesn't work because: (1) many references are to `x-bte-response-mapping/gene` but that doesn't exist (the two objects in response-mapping are `drug` and `disease`), and (2) the `drug` object includes multiple output fields which currently isn't supported in x-bte annotation/BTE... To fix: * 1 response-mapping object per output field (so `agr.biomarker_via_orthology.doid` and `agr.implicated_via_orthology.doid` would be in separate objects) * and 1 response-mapping object per ID-namespace (so `RGD: _id` and `MGI: _id` would be in separate objects) * make sure the response_mapping ref for each operation points to an existing object in the `x-bte-response-mapping` section

colleenXu commented 12 months ago

And a note (mostly to my future self), here's the other stuff I noticed. It's not essential now, but will be for getting the AGR SmartAPI yaml fully ready

click to expand

- `version`: I'm not sure if this is valid. The metadata endpoint seems to show that the data download is 2021? - `info.x-translator.infores`: this needs to be a separate new one for this api, and registered in the infores registry - `info.x-tranlsator.biolink-version`: this can be updated to 3.5.3 - `servers.url`: Production server url should (?) be changed to http (right now it's https which makes it the same as encrypted one) - For operations, we could likely add qualifier for species o_0 since each namespace is species-specific! That's cool! - For the operation's `source`: does `infores:agrkb` exist in registry? Or is it AGR?

colleenXu commented 11 months ago

After discussion with Andrew, we've decided to merge this PR and I'll proceed with updating the yaml to complete https://github.com/biothings/biothings_explorer/issues/260