Closed colleenXu closed 8 months ago
This should be able to be solved by a custom pairCurieWithAPIResponse function, I can work on this in the JQ and/or javascript transformer for CTD
Here is the pairCurieWithAPIResponse JQ solves this problem.
reduce (.response | .[]) as $item ({}; .[generateCurie($edge.association.input_id; $item.Input | ascii_upcase)] = [] + .[generateCurie($edge.association.input_id; $item.Input | ascii_upcase)] + [$item]) | map_values([.])
Will push shortly to JQ branch but I would need to double check the "Input" field is present in all queries to CTD
(this pair function could also be set in the yaml for an operation via transformer.pair_jq)
@tokebe
It's not clear to me how BTE will construct large batch-queries to CTD, and whether we'll need to make adjustments to BTE. I'm specifically thinking about:
inputTerms
as a parameter. Will BTE construct these properly (aka not exceed the character limit)?useTemplating: true
, but all x-bte operations do that now - including the CTD ones). Notes:
inputTerms
in the requestBody rather than the parameters of the request. It seems that POST queries only allow the inputTerms
to be in the parameters OR an uploaded file (tsv-only? queryFile
and queryFileColumn
parameter described here and here) multipart/form-data
encoded file, but how exactly queryFile
is meant to work with that is beyond me at the moment.Replying to @tokebe (thanks for the quick reply!) with my thoughts:
I think a safe batch-size is 80 IDs, assuming a 2048 character-max for the GET url.
`2048 = a*x + (x-1) + b = (a+1)*x + (b-1)`
Where:
- `x` is the max number of IDs (round down to nearest integer)
- `a` is the number of characters in each ID (in API's required format)
- `b` is the number of characters in the rest of the url, which depends on the dataset/relationship and input ID namespaces
- `a*x` is for all the ID characters, `(x-1)` is for all the pipe-delimiters
The most crucial number is `a`. **The max number of characters for 1 input ID is 21 for REACT (Pathway) IDs.**
- 10
- **MESH IDs without prefix**: 1 (C or D) plus 9 characters max according to [bioregistry](https://bioregistry.io/registry/mesh)
- **NCBIGene IDs without prefix, estimated**: the longest ID I found in my browser history is 9 characters, [106099062](https://www.ncbi.nlm.nih.gov/gene/106099062)). I'm estimating because [bioregistry](https://bioregistry.io/registry/ncbigene) doesn't give a character limit
- 11
- **OMIM IDs with prefix, estimated**: 5 (`OMIM:`) + 6 characters, based on looking at the [new entries like 620637](https://omim.org/statistics/updates/2023/11)). I'm estimating because [bioregistry](https://bioregistry.io/registry/omim) doesn't give a character limit
- 14
- **KEGG.PATHWAY IDs with custom prefix**: 5 (`KEGG:`) + 9 characters max, based on [bioregistry](https://bioregistry.io/registry/kegg.pathway)
- 15
- **MESH IDs with prefix**: 5 (`MESH:`) + 10 (explained above)
- 21
- **REACT IDs with prefix, estimated**: 6 (`REACT:`) + 15 characters, based on looking at the v86 (latest) new/updated topics and pathways like [REACT:R-HSA-9836573.1](https://reactome.org/content/detail/R-HSA-9836573) (Mitochondrial RNA degradation)
click to see character num for all input IDs
@tokebe
I'm getting JQ-related errors when I try to test the batch-size limit, using the process in the next section.
Noticed on ci/dev instances, but not test/prod. No overrides, no batch-size-limit-testing adjustments done.
TRAPI query:
```
{
"message": {
"query_graph": {
"edges": {
"e01": {
"subject": "n0",
"object": "n1",
"predicates": ["biolink:related_to"]
}
},
"nodes": {
"n0": {
"ids": ["MESH:D020138"],
"categories": ["biolink:Disease"]
},
"n1": {
"categories": ["biolink:Gene"]
}
}
}
}
}
```
2/3 subqueries fail with `Error: jq: error (at
Follow the steps in the next section, but don't set the batch-size-limit (step 2 in the next section)
Then do the simple query that works in dev without the override:
```
{
"message": {
"query_graph": {
"edges": {
"e01": {
"subject": "n0",
"object": "n1",
"predicates": ["biolink:related_to"]
}
},
"nodes": {
"n0": {
"ids": ["REACT:R-HSA-5669034", "REACT:R-HSA-5668541"],
"categories": ["biolink:Pathway"]
},
"n1": {
"categories": ["biolink:Gene"]
}
}
}
}
}
```
I'd normally get 134 results, but instead I get 0 results. In the console logs, the sub-query fails with `Error: jq: error (at
1. Setup: Check out the right branches (either main or dev), `pnpm i`.
To [API_BATCH_SIZE](https://github.com/biothings/bte_trapi_query_graph_handler/blob/c4eb2bb1e2bcc54f60858584dc0dcf71692b78f0/src/config.ts#L1), add:
```
{
id: '0212611d1c670f9107baf00b77f0889a',
name: 'CTD API',
max: 80,
},
```
I actually override to my local file with the branch checked out, but this should do the same thing.
Paste into [BTE's smartapi_overrides file](https://github.com/biothings/bte-server/blob/main/src/config/smartapi_overrides.json), so [it'll use this x-bte annotation](https://github.com/NCATS-Tangerine/translator-api-registry/blob/ctd-batch-query/CTD/smartapi.yaml):
```
{
"conf": {
"only_overrides": true
},
"apis": {
"0212611d1c670f9107baf00b77f0889a": "https://raw.githubusercontent.com/NCATS-Tangerine/translator-api-registry/ctd-batch-query/CTD/smartapi.yaml"
}
}
```
```
bte:call-apis:query using template builder +0ms
bte:call-apis:query query success, transforming hits->records... +0ms
bte:api-response-transform:index api name CTD API +0ms
bte:api-response-transform:index api tags: translator,ctd +0ms
bte:call-apis:query Failed to make to following query: {"url":"http://ctdbase.org/tools/batchQuery.go","params":{"inputType":"pathway","inputTerms":"REACT:R-HSA-446193|REACT:R-HSA-196780|REACT:R-HSA-9636467|REACT:R-HSA-9033658|REACT:R-HSA-70895|REACT:R-HSA-352238|REACT:R-HSA-168302|REACT:R-HSA-162588|REACT:R-HSA-450385|REACT:R-HSA-8851680|REACT:R-HSA-5621481|REACT:R-HSA-75102|REACT:R-HSA-5218900|REACT:R-HSA-9662834|REACT:R-HSA-5621575|REACT:R-HSA-5690714|REACT:R-HSA-389356|REACT:R-HSA-389357|REACT:R-HSA-389359|REACT:R-HSA-9013148|REACT:R-HSA-68689|REACT:R-HSA-9833576|REACT:R-HSA-69017|REACT:R-HSA-447041|REACT:R-HSA-5607763|REACT:R-HSA-5607764|REACT:R-HSA-5660668|REACT:R-HSA-6811434|REACT:R-HSA-6811436|REACT:R-HSA-6807878|REACT:R-HSA-204005|REACT:R-HSA-140180|REACT:R-HSA-199920|REACT:R-HSA-442742|REACT:R-HSA-442720|REACT:R-HSA-442729|REACT:R-HSA-8874211|REACT:R-HSA-399956|REACT:R-HSA-2024101|REACT:R-HSA-389513|REACT:R-HSA-5358747|REACT:R-HSA-5358749|REACT:R-HSA-5358751|REACT:R-HSA-5358752|REACT:R-HSA-211999|REACT:R-HSA-111996|REACT:R-HSA-1296052|REACT:R-HSA-4086398|REACT:R-HSA-111997|REACT:R-HSA-111932|REACT:R-HSA-2025928|REACT:R-HSA-419812|REACT:R-HSA-111933|REACT:R-HSA-901042|REACT:R-HSA-111957|REACT:R-HSA-72737|REACT:R-HSA-8955332|REACT:R-HSA-5576891|REACT:R-HSA-9733709|REACT:R-HSA-5694530","inputTermSearchType":"directAssociations","report":"genes_curated","format":"json"},"method":"get","timeout":50000,"headers":{"User-Agent":"BTE/dev Node/v18.16.1 darwin"}}. The error is Error: jq: error (at 2. Adding the batch-size limit to the query-handler's config
3. Setting an override to use CTD x-bte annotation for batch-querying
Console log of a sub-query
Looks like this is a problem in the JQ string, largely due to CTD's inconsistent response structure depending on if anything was found or not. Working on a fix...
Ok, turns out this was less CTD's inconsistencies and more JQ's inconsistencies (and my lack of familiarity...). I've pushed a fix to dev which should address this.
The fix worked!
I tested all 3 example queries in my previous post in both dev and main (CI) branches. Everything worked as-intended without any errors.
The PRs to deploy are:
Update!
I've included the CTD x-bte changes in the overrides https://github.com/biothings/bte-server/pull/4 - so it'll deploy alongside the orphanet changes. I think the override will end up deploying with or after the code changes (JQ / batch-size-limit), so I don't anticipate any issues. (aka I think NodeNorm will deploy the orphanet changes at the same pace or slower than our deployments to instances).
I think we can close this issue once:
We'll then have a separate process to remove the overrides (not needed once the yaml PRs are all merged / registrations refreshed).
@tokebe
I double-checked and it's not working on CI, probably because of the larger cache-update issues (recent lab Slack convo)
POST to CTD through BTE CI `https://bte.ci.transltr.io/v1/smartapi/0212611d1c670f9107baf00b77f0889a/query` ``` { "message": { "query_graph": { "edges": { "e01": { "subject": "n0", "object": "n1", "predicates": ["biolink:related_to"] } }, "nodes": { "n0": { "ids": ["KEGG.PATHWAY:hsa05323", "KEGG.PATHWAY:hsa04917"], "categories": ["biolink:Pathway"] }, "n1": { "categories": ["biolink:Gene"] } } } } } ``` Based on the logs in the TRAPI response, I can tell that 2 sub-queries were sent (1 ID each). But if batch-querying was working, only 1 sub-query should have been sent. This may mean BTE CI didn't successfully use the override. ``` { "timestamp": "2023-12-16T06:08:27.395Z", "level": "DEBUG", "message": "call-apis: 2 planned queries for edge e01", "code": null }, { "timestamp": "2023-12-16T06:08:27.792Z", "level": "DEBUG", "message": "Successful GET http://ctdbase.org (1 ID): Pathway > has_participant > Gene (obtained 70 records, took 121ms)", "code": null }, { "timestamp": "2023-12-16T06:08:27.808Z", "level": "DEBUG", "message": "Successful GET http://ctdbase.org (1 ID): Pathway > has_participant > Gene (obtained 89 records, took 178ms)", "code": null }, ```
Issue should now be addressed by https://github.com/biothings/biothings_explorer/commit/3019cecf670e5b0fc04877c31956b2bbbc3d7e4e, please test again
Now it's working on BTE CI! Yay!
The previous test now works as-intended - with 1 planned batch-query. Logs:
{
"timestamp": "2023-12-18T21:40:08.965Z",
"level": "DEBUG",
"message": "call-apis: 1 planned queries for edge e01",
"code": null
},
{
"timestamp": "2023-12-18T21:40:09.492Z",
"level": "DEBUG",
"message": "Successful GET http://ctdbase.org (2 IDs): Pathway > has_participant > Gene (obtained 159 records, took 181ms)",
"code": null
},
I also tested the batch-size-limit=80 with a 150-QNode-IDs query (current max, see #762), and it worked too. Two sub-queries were sent (80 + 70)
POST to CTD through BTE CI `https://bte.ci.transltr.io/v1/smartapi/0212611d1c670f9107baf00b77f0889a/query` using the attached JSON as the request body: [CTD-150ReactIDs.txt](https://github.com/biothings/biothings_explorer/files/13708873/CTD-150ReactIDs.txt) Logs show that two sub-queries were sent (80 + 70), so the batch-size-limit of 80 was respected ``` { "timestamp": "2023-12-18T21:41:52.878Z", "level": "DEBUG", "message": "call-apis: 2 planned queries for edge e01", "code": null }, { "timestamp": "2023-12-18T21:42:02.309Z", "level": "DEBUG", "message": "Successful GET http://ctdbase.org (80 IDs): Pathway > has_participant > Gene (obtained 1703 records, took 195ms)", "code": null }, { "timestamp": "2023-12-18T21:42:02.344Z", "level": "DEBUG", "message": "Successful GET http://ctdbase.org (70 IDs): Pathway > has_participant > Gene (obtained 2603 records, took 290ms)", "code": null }, ```
I've confirmed that things work as-expected after the Prod deployment. Closing issue, updating the registered yamls and registrations, and opening another issue for removing the overrides.
Example: POST to https://bte.transltr.io/v1/smartapi/0212611d1c670f9107baf00b77f0889a/query, will get a response with results and a log saying Successful GET http://ctdbase.org (2 IDs): Pathway > has_participant > Gene (obtained 159 records, took 215ms)
. This shows that the batch-query occurred.
{
"message": {
"query_graph": {
"edges": {
"e01": {
"subject": "n0",
"object": "n1",
"predicates": ["biolink:related_to"]
}
},
"nodes": {
"n0": {
"ids": ["KEGG.PATHWAY:hsa05323", "KEGG.PATHWAY:hsa04917"],
"categories": ["biolink:Pathway"]
},
"n1": {
"categories": ["biolink:Gene"]
}
}
}
}
}
Intro: see intro section of https://github.com/biothings/BioThings_Explorer_TRAPI/issues/583#issue-1622873383. Originally noted in https://github.com/biothings/BioThings_Explorer_TRAPI/issues/558#issuecomment-1459097534
2. processing batch-queries correctly
The current x-bte-kgs-operations aren't written as batch-queries, even though the CTD API does allow batch-querying.
The problem is how BTE handles the batch-query responses. The API response is an array of associations (objects) - and each association matched to one of the input IDs. Each association has an "Input" field where the value is the matched input ID (all lowercase, has an ID-prefix for diseases (MESH or OMIM) and pathways (REACT or KEGG)).
However, BTE's default api-response-transform isn't correctly handling this - instead, it's linking the first input ID to every possible output ID.
Example:
Edit SmartAPI and run BTE locally
In a local copy of the [SmartAPI yaml](https://github.com/NCATS-Tangerine/translator-api-registry/blob/master/CTD/smartapi.yaml), copy-paste the following into the `chemical2gene` operation. It's changing the `supportBatch` and `queryInputs` info. ``` - supportBatch: true useTemplating: true inputs: - id: MESH semantic: SmallMolecule outputs: - id: NCBIGene semantic: Gene parameters: inputType: chem inputTerms: "{{ queryInputs | joinSafe('|') }}" inputTermSearchType: directAssociations report: genes_curated format: json predicate: related_to response_mapping: "$ref": "#/components/x-bte-response-mapping/chemical2gene" ``` Set up a local instance of BTE to override and use your local copy of the CTD yaml. Then POST to that specific api (v1/smartapi/{id}/query endpoint): ``` { "message": { "query_graph": { "edges": { "e01": { "subject": "n0", "object": "n1", "predicates": ["biolink:related_to"] } }, "nodes": { "n0": { "ids": ["MESH:C006303", "MESH:D015250"], "categories": ["biolink:SmallMolecule"] }, "n1": { "categories": ["biolink:Gene"] } } } } } ```CTD's raw response
During execution, BTE should generate [this query with two input IDs](http://ctdbase.org/tools/batchQuery.go?inputType=chem&inputTerms=C006303|D015250&inputTermSearchType=directAssociations&report=genes_curated&format=json) to CTD. In CTD's raw response, some genes are only linked to the second ID D015250 / Aclarubicin, like PARP1. ``` { "CasRN": "57576-44-0", "ChemicalId": "D015250", "ChemicalName": "Aclarubicin", "GeneId": "142", "GeneSymbol": "PARP1", "Input": "d015250", "Organism": "Homo sapiens", "OrganismId": "9606", "PubMedIds": "20399885" }, ```BTE's current flawed response
BTE links every output gene with only the first ID C006303 / acivicin / `PUBCHEM.COMPOUND:294641`. It's easier to see through the console log: ``` bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:294641_&_n1-NCBIGene:836 has 4 +1ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:294641_&_n1-NCBIGene:1080 has 1 +0ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:294641_&_n1-NCBIGene:10800 has 1 +1ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:294641_&_n1-NCBIGene:2678 has 3 +0ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:294641_&_n1-NCBIGene:834 has 1 +0ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:294641_&_n1-NCBIGene:841 has 1 +0ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:294641_&_n1-NCBIGene:1676 has 1 +0ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:294641_&_n1-NCBIGene:2623 has 1 +0ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:294641_&_n1-NCBIGene:2950 has 1 +0ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:294641_&_n1-NCBIGene:3145 has 1 +0ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:294641_&_n1-NCBIGene:4778 has 1 +0ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:294641_&_n1-NCBIGene:2908 has 1 +0ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:294641_&_n1-NCBIGene:142 has 1 +0ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:294641_&_n1-NCBIGene:6582 has 1 +0ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:294641_&_n1-NCBIGene:6607 has 1 +0ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:294641_&_n1-NCBIGene:6647 has 1 +0ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:294641_&_n1-NCBIGene:331 has 1 +0ms ```desired format for BTE's response
Instead, BTE should correctly link each input ID / entity with its associations. The console log should look like this: * some results have the first input ID C006303 / acivicin / `PUBCHEM.COMPOUND:294641` * other results have the second input ID D015250 / Aclarubicin / `PUBCHEM.COMPOUND:451415` * PARP1 (NCBIGene:142) is only linked to the second ID: `PUBCHEM.COMPOUND:451415_&_n1-NCBIGene:142`. Most genes are linked to only one of the input IDs. ``` bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:294641_&_n1-NCBIGene:836 has 1 +1ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:294641_&_n1-NCBIGene:1080 has 1 +0ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:294641_&_n1-NCBIGene:10800 has 1 +0ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:294641_&_n1-NCBIGene:2678 has 3 +0ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:451415_&_n1-NCBIGene:834 has 1 +0ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:451415_&_n1-NCBIGene:836 has 3 +0ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:451415_&_n1-NCBIGene:841 has 1 +0ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:451415_&_n1-NCBIGene:1676 has 1 +0ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:451415_&_n1-NCBIGene:2623 has 1 +0ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:451415_&_n1-NCBIGene:2950 has 1 +0ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:451415_&_n1-NCBIGene:3145 has 1 +1ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:451415_&_n1-NCBIGene:4778 has 1 +0ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:451415_&_n1-NCBIGene:2908 has 1 +0ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:451415_&_n1-NCBIGene:142 has 1 +0ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:451415_&_n1-NCBIGene:6582 has 1 +0ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:451415_&_n1-NCBIGene:6607 has 1 +0ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:451415_&_n1-NCBIGene:6647 has 1 +0ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:451415_&_n1-NCBIGene:331 has 1 +0ms ```