Open dandanlen opened 4 years ago
The token validation logic has side effects. It can get out of sync during high load. I'll try to replicate.
No reproduction as of yet. If someone manages to trigger this error please provide details here.
@sebastian has reported that there are random failures from a user running queries under high load through the postgres interface. Could they possibly be related?
Yes, that could very well be! I am wondering, and kind of hate the idea, whether we should cache the user credentials validations step in a GenServer or something similar (short expiry – 1 second or something?). That could dramatically reduce the database connection pool pressure under high load.
My suspicion is that the large load of query updates pouring back from cloak exhaust the DB pool in air. Add to this the constant authentication requests. This exhaustion causes other updates to fail randomly... I don't have a reliable reproduction for any of these issues. In fact I don't have any kind of reproduction at all!
The jobs
queue was well thought in the state updater. I may have found why it was freezing, because I have run into something similar when limiting analysis queries in cloak. I think :jobs
freezes if you try to re-initialize the queue (if a supervisor restarts it). With proper configuration we can have the queue as a singleton and not touch it ever after.
Related info from Durak from Bosch (the person @edongashi referred to above). She writes:
To update you, I had the similar problem on the remote server. It seems like postgresql server complains and stops. I reduced the number of queries per connection to below 20K (no parallel queries), and it seems working.
By "remote server" she means running the script from a server as opposed to from her own machine where her local network could have been the cause of the connections breaking/timing out.
So this could be related to the problem you refer to above, or it could be a matter of our Postgres connection state machine getting out of a valid state when you use it for too long (some edge case triggered after it having run for a while, or some leak not detected when running a moderate number of queries).
it could be a matter of our Postgres connection state machine getting out of a valid state when you use it for too long (some edge case triggered after it having run for a while, or some leak not detected when running a moderate number of queries).
I'll test this locally to see if I can break it.
I have been running 7 parallel clients each queuing 500 simultaneous queries. The system handles it extremely well and did not drop any request at all. The query is very simple, but with such a high load it receives more queries than it can chew. As such I don't think a high parallel load is causing the issues.
Maybe if the result of a query is too big it breaks the channel - I'll have to test that. Also need to determine if a long running client eventually malfunctions...
Air seems to randomly reject valid api tokens.
The api token is only passed to the Explorer once, upon the initial request. The request parameters are validated, which implicitly validates the api token (params are validated via the
/datasources
api endpoint which required authentication). So the token should be valid.Here is an example response. The query is not very important, this happens for all kinds of queries. It appears to happen mainly when the system is under high load, ie. when Explorer sends lots of concurrent queries.
If needed I can provide other examples like the one below.