frostyfan109 / tranql

A Translator Query Language
https://researchsoftwareinstitute.github.io/data-translator/apps/tranql
MIT License
0 stars 1 forks source link

Figure out why the ICEES schema is always being requested #112

Closed frostyfan109 closed 5 years ago

frostyfan109 commented 5 years ago

It gets requested an excessive amount of times

frostyfan109 commented 5 years ago

For example, when the Workflow 5 query is run, observe the logs made in the backplane, I counted a total of around 7 requests to /clinical/icees/schema.

frostyfan109 commented 5 years ago

In reality, all the schemas are likely being fetched this many times by the API, it's just that because the ICEES schema goes through the backplane, and therefore gets logged, we only notice the ICEES schema being requested lots.

frostyfan109 commented 5 years ago

The above comment gets it right but doesn't explain the full story. Every time that /tranql/query is invoked, the interpreter has to construct a new Schema object from tranql_schema.py. The Schema object's constructor in turn has to fetch every reasoner's schema before the query can proceed. As the above comment states, the reason ICEES is singled out is because it is the only reasoner whose schema has to be retrieved by the backplane, and therefore it is the only reasoner whose schema request gets logged in the backplane.

This is also why the /tranql/reasonerURLs endpoint takes so long to load. It really just wants to get the valid values in the FROM clause, but in order to do that it has to construct a Schema object as well, which then has to fetch every reasoner's schema, even though it doesn't care about each reasoner's schema.

The fetching of the schema in /tranql/query is unavoidable, but we could probably expedite the process by enabling asynchronous requests for schema fetching. Since currently it takes about 10 seconds to construct the schema synchronously, my guess is that it would take around 3-4 seconds at most which asynchronous requests.

NOTE: It also seems like the schema is being fetched numerous times in a single query. See these logs:

127.0.0.1 - - [15/Aug/2019 10:26:15] "GET /clincial/icees/schema HTTP/1.1" 200 -
127.0.0.1 - - [15/Aug/2019 10:27:19] "GET /clincial/icees/schema HTTP/1.1" 200 -
127.0.0.1 - - [15/Aug/2019 10:33:40] "GET /clincial/icees/schema HTTP/1.1" 200 -
127.0.0.1 - - [15/Aug/2019 10:36:03] "GET /clincial/icees/schema HTTP/1.1" 200 -
127.0.0.1 - - [15/Aug/2019 10:36:30] "GET /clincial/icees/schema HTTP/1.1" 200 -
127.0.0.1 - - [15/Aug/2019 10:38:46] "GET /clincial/icees/schema HTTP/1.1" 200 -
127.0.0.1 - - [15/Aug/2019 10:38:54] "GET /clincial/icees/schema HTTP/1.1" 200 -
127.0.0.1 - - [15/Aug/2019 10:39:02] "POST /graph/rtx HTTP/1.1" 200 -

If the ICEES schema is being fetched multiple times per query, then the schema must be getting constructed in different places, which would certainly slow the query down. I don't think it's something like the schema getting constructed every time a generated select statement is executed (from /schema) because as seen in the logs above, it happens with single-reasoner queries.

Update: asynchronous schema fetching doesn't really do speed it up at all and complicates the code a lot more than it already is.

frostyfan109 commented 5 years ago

It looks like the schema is being constructed in two places when a query is executed:

However, the schema ends up getting constructed in every single synthetic select statement as well, even though they don't need it.

frostyfan109 commented 5 years ago

Fixed.