Investigating Creative Mode queries which nearly always go over time

tokebe commented 5 months ago

As it stands, BTE is able to handle many simultaneous queries without any failures or going over time. However, some queries are prone to failure by timeout and may affect assessments of BTE's concurrency capability. These failures seem to be due to subquery time, as a profile of some suspect queries shows a majority of system time spent waiting, with hard computation contributing little to overall execution time. These should be investigated further so we can identify good ways to avoid timeout scenarios.

I have 3 example queries which have a tendency to time out. I've captured them by running with no time limit on my local, mirroring CI's environment for smartapi specs/etc. All example queries so far are from Drug -> Treats -> Disease.

MONDO:0019609

This query occasionally does complete within 5 minutes, but appears to fail frequently, especially if there's any other pressure on the server.

Template 1: 59 Queries, 8 results Template 2: 347 Queries, 4137 results

Overall:

API	Times Called	Time Spent Waiting (in seconds)
https://mychem.info/v1	10	11.735
https://biothings.ncats.io/idisk	2	1.390
https://biothings.ncats.io/gtrx	2	0.841
https://biothings.ncats.io/semmeddb	336	246.721
https://biothings.ncats.io/ttd	4	2.291
https://biothings.ncats.io/suppkg	2	2.504
https://biothings.ncats.io/text_mining_targeted_association	11	35.023
https://api.monarchinitiative.org/api	38	246.361
https://mydisease.info/v1	1	0.529

MONDO:0005377

Template 1: 60 queries, 420 results Template 2: 211 queries, 4829 results

Overall:

API	Times Called	Time Spent Waiting (in seconds)
https://biothings.ncats.io/text_mining_targeted_association	10	67.979
https://mychem.info/v1	10	13.664
https://biothings.ncats.io/idisk	2	1.577
https://biothings.ncats.io/ttd	4	2.925
https://biothings.ncats.io/gtrx	2	1.513
https://biothings.ncats.io/semmeddb	192	199.482
https://biothings.ncats.io/suppkg	2	1.823
https://api.monarchinitiative.org/api	48	1336
https://mydisease.info/v1	1	0.403

MONDO:0016575

Template 1: 61 queries, 192 results Template 2: 240 queries, 4460

Overall:

API	Times Called	Time Spent Waiting (in seconds)
https://mychem.info/v1	10	14.768
https://biothings.ncats.io/text_mining_targeted_association	12	34.629
https://biothings.ncats.io/idisk	2	1.376
https://biothings.ncats.io/semmeddb	216	193.422
https://biothings.ncats.io/gtrx	2	1.037
https://biothings.ncats.io/ttd	4	2.895
https://biothings.ncats.io/suppkg	2	2.49
https://api.monarchinitiative.org/api	52	1126
https://mydisease.info/v1	1	0.503

Notes

For each table, I rounded seconds to the nearest millisecond. It's also worth noting that time spent waiting is not representative of overall time contribution to execution, as sub-queries can happen simultaneously. However, a high ratio of time spent waiting to times called heavily implies significant contribution to overall query time.

More analysis remains to be done, however we can see that Monarch/Biolink is the top contributor to overall time. A single query execution in BTE can make 32 simultaneous subqueries, meaning an average minimum time contribution of ~23.7 seconds by Monarch/Biolink, with the other major-looking players being much smaller by comparison (~6.65 from SemmedDB, and ~4.24 from Text Mining Targeted).

It's also worth noting that time taken includes both query time and response transformation time, so it's possible that specific queries take a long time to transform, rather than the API taking a long time to respond, however previous infromation from Sentry makes this appear unlikely.

Next up I'll see if I can identify any promiscuous nodes causing these large numbers of queries.

tokebe commented 5 months ago

It's also worth noting that theoretical maximum speed (total time for all queries / 32 subqueries at a time) doesn't give an accurate picture of actual total subquery time. I'll look more into this as well.

tokebe commented 5 months ago

Just posting some rankings, the count is the number of times the node was referenced as an edge's subject or object. I've truncated it to just 100+ references:

MONDO:0019609	MONDO:0005377	MONDO:0016575

I'll create a more useful tabulation later -- the first of each is obviously just the query ID, but it appears that we have some fairly highly-used nodes, which mostly appear to come from phenotypes in Template-2, which generally makes sense. I'll probably need to compare the info so far to a query which doesn't usually time out, though I suspect the answer will generally come down to "It doesn't hit Template 2".

biothings / biothings_explorer