CTD processing 2: batch-queries

colleenXu commented 1 year ago

Intro: see intro section of https://github.com/biothings/BioThings_Explorer_TRAPI/issues/583#issue-1622873383. Originally noted in https://github.com/biothings/BioThings_Explorer_TRAPI/issues/558#issuecomment-1459097534

2. processing batch-queries correctly

The current x-bte-kgs-operations aren't written as batch-queries, even though the CTD API does allow batch-querying.

The problem is how BTE handles the batch-query responses. The API response is an array of associations (objects) - and each association matched to one of the input IDs. Each association has an "Input" field where the value is the matched input ID (all lowercase, has an ID-prefix for diseases (MESH or OMIM) and pathways (REACT or KEGG)).

However, BTE's default api-response-transform isn't correctly handling this - instead, it's linking the first input ID to every possible output ID.

Example:

Edit SmartAPI and run BTE locally

In a local copy of the [SmartAPI yaml](https://github.com/NCATS-Tangerine/translator-api-registry/blob/master/CTD/smartapi.yaml), copy-paste the following into the `chemical2gene` operation. It's changing the `supportBatch` and `queryInputs` info. ``` - supportBatch: true useTemplating: true inputs: - id: MESH semantic: SmallMolecule outputs: - id: NCBIGene semantic: Gene parameters: inputType: chem inputTerms: "{{ queryInputs | joinSafe('|') }}" inputTermSearchType: directAssociations report: genes_curated format: json predicate: related_to response_mapping: "$ref": "#/components/x-bte-response-mapping/chemical2gene" ``` Set up a local instance of BTE to override and use your local copy of the CTD yaml. Then POST to that specific api (v1/smartapi/{id}/query endpoint): ``` { "message": { "query_graph": { "edges": { "e01": { "subject": "n0", "object": "n1", "predicates": ["biolink:related_to"] } }, "nodes": { "n0": { "ids": ["MESH:C006303", "MESH:D015250"], "categories": ["biolink:SmallMolecule"] }, "n1": { "categories": ["biolink:Gene"] } } } } } ```

CTD's raw response

During execution, BTE should generate [this query with two input IDs](http://ctdbase.org/tools/batchQuery.go?inputType=chem&inputTerms=C006303|D015250&inputTermSearchType=directAssociations&report=genes_curated&format=json) to CTD. In CTD's raw response, some genes are only linked to the second ID D015250 / Aclarubicin, like PARP1. ``` { "CasRN": "57576-44-0", "ChemicalId": "D015250", "ChemicalName": "Aclarubicin", "GeneId": "142", "GeneSymbol": "PARP1", "Input": "d015250", "Organism": "Homo sapiens", "OrganismId": "9606", "PubMedIds": "20399885" }, ```

BTE's current flawed response

BTE links every output gene with only the first ID C006303 / acivicin / `PUBCHEM.COMPOUND:294641`. It's easier to see through the console log: ``` bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:294641_&_n1-NCBIGene:836 has 4 +1ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:294641_&_n1-NCBIGene:1080 has 1 +0ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:294641_&_n1-NCBIGene:10800 has 1 +1ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:294641_&_n1-NCBIGene:2678 has 3 +0ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:294641_&_n1-NCBIGene:834 has 1 +0ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:294641_&_n1-NCBIGene:841 has 1 +0ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:294641_&_n1-NCBIGene:1676 has 1 +0ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:294641_&_n1-NCBIGene:2623 has 1 +0ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:294641_&_n1-NCBIGene:2950 has 1 +0ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:294641_&_n1-NCBIGene:3145 has 1 +0ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:294641_&_n1-NCBIGene:4778 has 1 +0ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:294641_&_n1-NCBIGene:2908 has 1 +0ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:294641_&_n1-NCBIGene:142 has 1 +0ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:294641_&_n1-NCBIGene:6582 has 1 +0ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:294641_&_n1-NCBIGene:6607 has 1 +0ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:294641_&_n1-NCBIGene:6647 has 1 +0ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:294641_&_n1-NCBIGene:331 has 1 +0ms ```

desired format for BTE's response

Instead, BTE should correctly link each input ID / entity with its associations. The console log should look like this: * some results have the first input ID C006303 / acivicin / `PUBCHEM.COMPOUND:294641` * other results have the second input ID D015250 / Aclarubicin / `PUBCHEM.COMPOUND:451415` * PARP1 (NCBIGene:142) is only linked to the second ID: `PUBCHEM.COMPOUND:451415_&_n1-NCBIGene:142`. Most genes are linked to only one of the input IDs. ``` bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:294641_&_n1-NCBIGene:836 has 1 +1ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:294641_&_n1-NCBIGene:1080 has 1 +0ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:294641_&_n1-NCBIGene:10800 has 1 +0ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:294641_&_n1-NCBIGene:2678 has 3 +0ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:451415_&_n1-NCBIGene:834 has 1 +0ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:451415_&_n1-NCBIGene:836 has 3 +0ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:451415_&_n1-NCBIGene:841 has 1 +0ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:451415_&_n1-NCBIGene:1676 has 1 +0ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:451415_&_n1-NCBIGene:2623 has 1 +0ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:451415_&_n1-NCBIGene:2950 has 1 +0ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:451415_&_n1-NCBIGene:3145 has 1 +1ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:451415_&_n1-NCBIGene:4778 has 1 +0ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:451415_&_n1-NCBIGene:2908 has 1 +0ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:451415_&_n1-NCBIGene:142 has 1 +0ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:451415_&_n1-NCBIGene:6582 has 1 +0ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:451415_&_n1-NCBIGene:6607 has 1 +0ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:451415_&_n1-NCBIGene:6647 has 1 +0ms bte:biothings-explorer-trapi:QueryResult result ID: n0-PUBCHEM.COMPOUND:451415_&_n1-NCBIGene:331 has 1 +0ms ```

rjawesome commented 1 year ago

This should be able to be solved by a custom pairCurieWithAPIResponse function, I can work on this in the JQ and/or javascript transformer for CTD

rjawesome commented 1 year ago

Here is the pairCurieWithAPIResponse JQ solves this problem. reduce (.response | .[]) as $item ({}; .[generateCurie($edge.association.input_id; $item.Input | ascii_upcase)] = [] + .[generateCurie($edge.association.input_id; $item.Input | ascii_upcase)] + [$item]) | map_values([.]) Will push shortly to JQ branch but I would need to double check the "Input" field is present in all queries to CTD (this pair function could also be set in the yaml for an operation via transformer.pair_jq)

colleenXu commented 1 year ago

@tokebe

It's not clear to me how BTE will construct large batch-queries to CTD, and whether we'll need to make adjustments to BTE. I'm specifically thinking about:

url character-limits: the batch-queries are GET requests w/ inputTerms as a parameter. Will BTE construct these properly (aka not exceed the character limit)?
max batch size: I think CTD has a batch-size-limit of 4000 IDs. I'm not sure if putting this limit in BTE's query-handler here will ensure that BTE doesn't exceed this batch-size limit, since the comment on that line implies that it's just for pending BioThings APIs that do POST queries (also with useTemplating: true, but all x-bte operations do that now - including the CTD ones).

Notes:

we don't do batch-querying for any of the other external APIs
this endpoint also accepts POST queries, but I haven't figured out a way to do a POST query AND put the inputTerms in the requestBody rather than the parameters of the request. It seems that POST queries only allow the inputTerms to be in the parameters OR an uploaded file (tsv-only? queryFile and queryFileColumn parameter described here and here)

tokebe commented 1 year ago

I don't believe BTE presently controls batch size with respect to URL character limit. This is an enhancement we should probably add. For now, it should be possible to reason about the maximum we could fit in a URL and set a conservative batch size by that.
That batch size limit works for everything you can set it for any smartapi id. The comment (AFAIK) was a comment you added to explain the purpose of the current entries.
I did some playing around trying to figure out the intended method for POST batch queries, but the documentation is rather unclear. I think it's meant to be a multipart/form-data encoded file, but how exactly queryFile is meant to work with that is beyond me at the moment.

colleenXu commented 1 year ago

Replying to @tokebe (thanks for the quick reply!) with my thoughts:

I think setting the batch-size by the minimum we can fit in a URL makes sense? Aka take the operations with the longest IDs (ones that keep the OMIM or MESH prefix probably) and do some rough calculations on how many of those can fit in the limit...
I'm still not clear on what would be BTE or CTD's URL character limit....do we know?
I'm not sure how to easily test the batch-size-limit after we set it up...
on the POST method...I don't think BTE is set up to generate those kinds of requests (aka send a file), right?

colleenXu commented 11 months ago

I think a safe batch-size is 80 IDs, assuming a 2048 character-max for the GET url.

Rough calculations

`2048 = a*x + (x-1) + b = (a+1)*x + (b-1)` Where: - `x` is the max number of IDs (round down to nearest integer) - `a` is the number of characters in each ID (in API's required format) - `b` is the number of characters in the rest of the url, which depends on the dataset/relationship and input ID namespaces - `a*x` is for all the ID characters, `(x-1)` is for all the pipe-delimiters The most crucial number is `a`. **The max number of characters for 1 input ID is 21 for REACT (Pathway) IDs.**

click to see character num for all input IDs

- 10 - **MESH IDs without prefix**: 1 (C or D) plus 9 characters max according to [bioregistry](https://bioregistry.io/registry/mesh) - **NCBIGene IDs without prefix, estimated**: the longest ID I found in my browser history is 9 characters, [106099062](https://www.ncbi.nlm.nih.gov/gene/106099062)). I'm estimating because [bioregistry](https://bioregistry.io/registry/ncbigene) doesn't give a character limit - 11 - **OMIM IDs with prefix, estimated**: 5 (`OMIM:`) + 6 characters, based on looking at the [new entries like 620637](https://omim.org/statistics/updates/2023/11)). I'm estimating because [bioregistry](https://bioregistry.io/registry/omim) doesn't give a character limit - 14 - **KEGG.PATHWAY IDs with custom prefix**: 5 (`KEGG:`) + 9 characters max, based on [bioregistry](https://bioregistry.io/registry/kegg.pathway) - 15 - **MESH IDs with prefix**: 5 (`MESH:`) + 10 (explained above) - 21 - **REACT IDs with prefix, estimated**: 6 (`REACT:`) + 15 characters, based on looking at the v86 (latest) new/updated topics and pathways like [REACT:R-HSA-9836573.1](https://reactome.org/content/detail/R-HSA-9836573) (Mitochondrial RNA degradation)

`b = 140` for the 1 x-bte operation that uses REACT IDs as input. For the 1 x-bte operation that uses REACT IDs as input. (An example GET url with 2 input IDs is: `http://ctdbase.org/tools/batchQuery.go?inputType=pathway&inputTermSearchType=directAssociations&report=genes_curated&format=json&inputTerms=REACT:R-HSA-5669034|REACT:R-HSA-5668541`) So the equation for this situation is: `2048 = (a+1)*x + (b-1) = (21+1)*x + (140-1) = 22*x + 139`, x ~ 86 Rounding down to the nearest ten gets 80.

colleenXu commented 11 months ago

@tokebe

I'm getting JQ-related errors when I try to test the batch-size limit, using the process in the next section.

If I start with the main branches, things seem to work okay. 1 of the 4 sub-queries fails, but that kind of error seems to be happening on dev/ci when I'm not testing the batch-size limit too.

Recreating the error with a simpler example, not testing the batch-size-limit

Noticed on ci/dev instances, but not test/prod. No overrides, no batch-size-limit-testing adjustments done. TRAPI query: ``` { "message": { "query_graph": { "edges": { "e01": { "subject": "n0", "object": "n1", "predicates": ["biolink:related_to"] } }, "nodes": { "n0": { "ids": ["MESH:D020138"], "categories": ["biolink:Disease"] }, "n1": { "categories": ["biolink:Gene"] } } } } } ``` 2/3 subqueries fail with `Error: jq: error (at :0): Cannot iterate over null (null)`: see full console logs [ctd-error-1.txt](https://github.com/biothings/biothings_explorer/files/13510911/ctd-error-1.txt) Interestingly, I think those two sub-queries are returning 0 hits: [this](http://ctdbase.org/tools/batchQuery.go?inputType=disease&inputTermSearchType=directAssociations&report=genes_curated&format=json&inputTerms=MESH:C566403) and [this](http://ctdbase.org/tools/batchQuery.go?inputType=disease&inputTermSearchType=directAssociations&report=genes_curated&format=json&inputTerms=OMIM:603174), vs [the 3rd sub-query that has hits](http://ctdbase.org/tools/batchQuery.go?inputType=disease&inputTermSearchType=directAssociations&report=genes_curated&format=json&inputTerms=MESH:D020138)

If I start with the dev branches, I encounter errors after doing the SmartAPI override (see step 6 in the next section). However, I also encounter this kind of error when I don't set the batch-size-limit (step 2) and when I use a simpler 2-ID query that normally works in dev (w/o the override).

recreating the problem with a simple query

Follow the steps in the next section, but don't set the batch-size-limit (step 2 in the next section) Then do the simple query that works in dev without the override: ``` { "message": { "query_graph": { "edges": { "e01": { "subject": "n0", "object": "n1", "predicates": ["biolink:related_to"] } }, "nodes": { "n0": { "ids": ["REACT:R-HSA-5669034", "REACT:R-HSA-5668541"], "categories": ["biolink:Pathway"] }, "n1": { "categories": ["biolink:Gene"] } } } } } ``` I'd normally get 134 results, but instead I get 0 results. In the console logs, the sub-query fails with `Error: jq: error (at :0): explode input must be a string`. The full console logs are: [simple-ctd-error-dev.txt](https://github.com/biothings/biothings_explorer/files/13511334/simple-ctd-error-dev.txt)

My full process to test the batch-size-limit

1. Setup: Check out the right branches (either main or dev), `pnpm i`.

2. Adding the batch-size limit to the query-handler's config

To [API_BATCH_SIZE](https://github.com/biothings/bte_trapi_query_graph_handler/blob/c4eb2bb1e2bcc54f60858584dc0dcf71692b78f0/src/config.ts#L1), add: ``` { id: '0212611d1c670f9107baf00b77f0889a', name: 'CTD API', max: 80, }, ```

3. Setting an override to use CTD x-bte annotation for batch-querying

I actually override to my local file with the branch checked out, but this should do the same thing. Paste into [BTE's smartapi_overrides file](https://github.com/biothings/bte-server/blob/main/src/config/smartapi_overrides.json), so [it'll use this x-bte annotation](https://github.com/NCATS-Tangerine/translator-api-registry/blob/ctd-batch-query/CTD/smartapi.yaml): ``` { "conf": { "only_overrides": true }, "apis": { "0212611d1c670f9107baf00b77f0889a": "https://raw.githubusercontent.com/NCATS-Tangerine/translator-api-registry/ctd-batch-query/CTD/smartapi.yaml" } } ```

4. `pnpm build`, then `API_OVERRIDE=true pnpm run smartapi_sync` to set up BTE with the changes and get the x-bte info 5. Run BTE, then query CTD thru BTE (`http://localhost:3000/v1/smartapi/0212611d1c670f9107baf00b77f0889a/query`) with this request body [trapi_300react.txt](https://github.com/biothings/biothings_explorer/files/13510446/trapi_300react.txt). It's a TRAPI query for 300 REACT IDs (Pathway) -> Gene. BTE then runs 4 sub-queries, which is correct (3*80 + 60). 1. Note: All the IDs are real IDs for human pathways (from [Reactome](https://reactome.org/download-data)'s Complete List of Pathways), but CTD may not have data for them. 6. If I started with dev instances and run that query, all the sub-queries fail with the message `The error is Error: jq: error (at :0): explode input must be a string` Full console logs: [console-300react.txt](https://github.com/biothings/biothings_explorer/files/13510534/console-300react.txt)

Console log of a sub-query

``` bte:call-apis:query using template builder +0ms bte:call-apis:query query success, transforming hits->records... +0ms bte:api-response-transform:index api name CTD API +0ms bte:api-response-transform:index api tags: translator,ctd +0ms bte:call-apis:query Failed to make to following query: {"url":"http://ctdbase.org/tools/batchQuery.go","params":{"inputType":"pathway","inputTerms":"REACT:R-HSA-446193|REACT:R-HSA-196780|REACT:R-HSA-9636467|REACT:R-HSA-9033658|REACT:R-HSA-70895|REACT:R-HSA-352238|REACT:R-HSA-168302|REACT:R-HSA-162588|REACT:R-HSA-450385|REACT:R-HSA-8851680|REACT:R-HSA-5621481|REACT:R-HSA-75102|REACT:R-HSA-5218900|REACT:R-HSA-9662834|REACT:R-HSA-5621575|REACT:R-HSA-5690714|REACT:R-HSA-389356|REACT:R-HSA-389357|REACT:R-HSA-389359|REACT:R-HSA-9013148|REACT:R-HSA-68689|REACT:R-HSA-9833576|REACT:R-HSA-69017|REACT:R-HSA-447041|REACT:R-HSA-5607763|REACT:R-HSA-5607764|REACT:R-HSA-5660668|REACT:R-HSA-6811434|REACT:R-HSA-6811436|REACT:R-HSA-6807878|REACT:R-HSA-204005|REACT:R-HSA-140180|REACT:R-HSA-199920|REACT:R-HSA-442742|REACT:R-HSA-442720|REACT:R-HSA-442729|REACT:R-HSA-8874211|REACT:R-HSA-399956|REACT:R-HSA-2024101|REACT:R-HSA-389513|REACT:R-HSA-5358747|REACT:R-HSA-5358749|REACT:R-HSA-5358751|REACT:R-HSA-5358752|REACT:R-HSA-211999|REACT:R-HSA-111996|REACT:R-HSA-1296052|REACT:R-HSA-4086398|REACT:R-HSA-111997|REACT:R-HSA-111932|REACT:R-HSA-2025928|REACT:R-HSA-419812|REACT:R-HSA-111933|REACT:R-HSA-901042|REACT:R-HSA-111957|REACT:R-HSA-72737|REACT:R-HSA-8955332|REACT:R-HSA-5576891|REACT:R-HSA-9733709|REACT:R-HSA-5694530","inputTermSearchType":"directAssociations","report":"genes_curated","format":"json"},"method":"get","timeout":50000,"headers":{"User-Agent":"BTE/dev Node/v18.16.1 darwin"}}. The error is Error: jq: error (at :0): explode input must be a string bte:call-apis:query with Error: jq: error (at :0): explode input must be a string bte:call-apis:query bte:call-apis:query at ChildProcess. (/Users/colleenxu/Desktop/BTE_typescript_pnpm/biothings_explorer/node_modules/.pnpm/node-jq@4.2.2/node_modules/node-jq/lib/exec.js:31:35) bte:call-apis:query at ChildProcess.emit (node:events:513:28) bte:call-apis:query at ChildProcess.emit (node:domain:489:12) bte:call-apis:query at maybeClose (node:internal/child_process:1091:16) bte:call-apis:query at ChildProcess._handle.onexit (node:internal/child_process:302:5) bte:call-apis:query at Process.callbackTrampoline (node:internal/async_hooks:130:17) +24ms ```

tokebe commented 11 months ago

Looks like this is a problem in the JQ string, largely due to CTD's inconsistent response structure depending on if anything was found or not. Working on a fix...

tokebe commented 11 months ago

Ok, turns out this was less CTD's inconsistencies and more JQ's inconsistencies (and my lack of familiarity...). I've pushed a fix to dev which should address this.

colleenXu commented 11 months ago

The fix worked!

I tested all 3 example queries in my previous post in both dev and main (CI) branches. Everything worked as-intended without any errors.

The PRs to deploy are:

add the batch-size limit for CTD https://github.com/biothings/bte_trapi_query_graph_handler/pull/173
- should work with current x-bte annotation (supportBatch: false): I tested and it seemed to work as-intended, ignoring this config entry
- So we should be able to merge this PR when we want to
adjust CTD x-bte annotation for batch-querying https://github.com/NCATS-Tangerine/translator-api-registry/pull/134
- WAIT until all code changes are in Prod before merging this PR / refreshing the registration. JQ and the batch-size-limit PRs are required for BTE to execute the updated x-bte annotation...

colleenXu commented 11 months ago

Update!

I've included the CTD x-bte changes in the overrides https://github.com/biothings/bte-server/pull/4 - so it'll deploy alongside the orphanet changes. I think the override will end up deploying with or after the code changes (JQ / batch-size-limit), so I don't anticipate any issues. (aka I think NodeNorm will deploy the orphanet changes at the same pace or slower than our deployments to instances).

colleenXu commented 11 months ago

I think we can close this issue once:

the code changes (JQ/batch-size-limit) + overrides are deployed to Prod
I merge the yaml PR

We'll then have a separate process to remove the overrides (not needed once the yaml PRs are all merged / registrations refreshed).

colleenXu commented 10 months ago

@tokebe

I double-checked and it's not working on CI, probably because of the larger cache-update issues (recent lab Slack convo)

My test

POST to CTD through BTE CI `https://bte.ci.transltr.io/v1/smartapi/0212611d1c670f9107baf00b77f0889a/query` ``` { "message": { "query_graph": { "edges": { "e01": { "subject": "n0", "object": "n1", "predicates": ["biolink:related_to"] } }, "nodes": { "n0": { "ids": ["KEGG.PATHWAY:hsa05323", "KEGG.PATHWAY:hsa04917"], "categories": ["biolink:Pathway"] }, "n1": { "categories": ["biolink:Gene"] } } } } } ``` Based on the logs in the TRAPI response, I can tell that 2 sub-queries were sent (1 ID each). But if batch-querying was working, only 1 sub-query should have been sent. This may mean BTE CI didn't successfully use the override. ``` { "timestamp": "2023-12-16T06:08:27.395Z", "level": "DEBUG", "message": "call-apis: 2 planned queries for edge e01", "code": null }, { "timestamp": "2023-12-16T06:08:27.792Z", "level": "DEBUG", "message": "Successful GET http://ctdbase.org (1 ID): Pathway > has_participant > Gene (obtained 70 records, took 121ms)", "code": null }, { "timestamp": "2023-12-16T06:08:27.808Z", "level": "DEBUG", "message": "Successful GET http://ctdbase.org (1 ID): Pathway > has_participant > Gene (obtained 89 records, took 178ms)", "code": null }, ```

tokebe commented 10 months ago

Issue should now be addressed by https://github.com/biothings/biothings_explorer/commit/3019cecf670e5b0fc04877c31956b2bbbc3d7e4e, please test again

colleenXu commented 10 months ago

Now it's working on BTE CI! Yay!

The previous test now works as-intended - with 1 planned batch-query. Logs:

        {
            "timestamp": "2023-12-18T21:40:08.965Z",
            "level": "DEBUG",
            "message": "call-apis: 1 planned queries for edge e01",
            "code": null
        },
        {
            "timestamp": "2023-12-18T21:40:09.492Z",
            "level": "DEBUG",
            "message": "Successful GET http://ctdbase.org (2 IDs): Pathway > has_participant > Gene (obtained 159 records, took 181ms)",
            "code": null
        },

I also tested the batch-size-limit=80 with a 150-QNode-IDs query (current max, see #762), and it worked too. Two sub-queries were sent (80 + 70)

Batch-size-limit test

POST to CTD through BTE CI `https://bte.ci.transltr.io/v1/smartapi/0212611d1c670f9107baf00b77f0889a/query` using the attached JSON as the request body: [CTD-150ReactIDs.txt](https://github.com/biothings/biothings_explorer/files/13708873/CTD-150ReactIDs.txt) Logs show that two sub-queries were sent (80 + 70), so the batch-size-limit of 80 was respected ``` { "timestamp": "2023-12-18T21:41:52.878Z", "level": "DEBUG", "message": "call-apis: 2 planned queries for edge e01", "code": null }, { "timestamp": "2023-12-18T21:42:02.309Z", "level": "DEBUG", "message": "Successful GET http://ctdbase.org (80 IDs): Pathway > has_participant > Gene (obtained 1703 records, took 195ms)", "code": null }, { "timestamp": "2023-12-18T21:42:02.344Z", "level": "DEBUG", "message": "Successful GET http://ctdbase.org (70 IDs): Pathway > has_participant > Gene (obtained 2603 records, took 290ms)", "code": null }, ```

colleenXu commented 8 months ago

I've confirmed that things work as-expected after the Prod deployment. Closing issue, updating the registered yamls and registrations, and opening another issue for removing the overrides.

Example: POST to https://bte.transltr.io/v1/smartapi/0212611d1c670f9107baf00b77f0889a/query, will get a response with results and a log saying Successful GET http://ctdbase.org (2 IDs): Pathway > has_participant > Gene (obtained 159 records, took 215ms). This shows that the batch-query occurred.

{
    "message": {
        "query_graph": {
            "edges": {
                "e01": {
                    "subject": "n0",
                    "object": "n1",
                    "predicates": ["biolink:related_to"]
                }
            },
            "nodes": {
                "n0": {
                    "ids": ["KEGG.PATHWAY:hsa05323", "KEGG.PATHWAY:hsa04917"],
                    "categories": ["biolink:Pathway"]
                },
                "n1": {
                    "categories": ["biolink:Gene"]
                }
            }
        }
    }
}

biothings / biothings_explorer

CTD processing 2: batch-queries #584

2. processing batch-queries correctly