Valid api token rejected

dandanlen commented 4 years ago

Air seems to randomly reject valid api tokens.

The api token is only passed to the Explorer once, upon the initial request. The request parameters are validated, which implicitly validates the api token (params are validated via the /datasources api endpoint which required authentication). So the token should be valid.

Here is an example response. The query is not very important, this happens for all kinds of queries. It appears to happen mainly when the system is under high load, ie. when Explorer sends lots of concurrent queries.

If needed I can provide other examples like the one below.

Request Error: Unauthorized -- Your API token is wrong.
Method: POST, RequestUri: 'https://demo.aircloak.com/api/queries', Version: 1.1, Content: System.Net.Http.StringContent, Headers:

{

  auth-token: SFMyNTY.g2gDbQAAACQ3OGNhNGYxNy1hNTU1LTQwOGItOTVhYS1jY2M3MjhiNjUzMzVuBgCXdAD8cwFiAAFRgA.c9SyzLidhpvUdS_bzCLWnNB1GGns7zq-kKRrHSqu2zI

  Request-Id: |f636e7da-4ea1f8b3ec7f9577.913.

  Content-Type: application/json

  Content-Length: 1265

}
{"query":{"statement":"\n                select\n                    concat(s0, s1, s2, s3, s4) as sstr,\n                    sum(count),\n                    sum(count_noise),\n                    case\n                        when s0 is not null then 0\n                        when s1 is not null then 1\n                        when s2 is not null then 2\n                        when s3 is not null then 3\n                        when s4 is not null then 4\n                    end as i\n                from (\n                    select\n                        substring(\u0022User ID\u0022, 21, 3) as s0,\n                        substring(\u0022User ID\u0022, 22, 3) as s1,\n                        substring(\u0022User ID\u0022, 23, 3) as s2,\n                        substring(\u0022User ID\u0022, 24, 3) as s3,\n                        substring(\u0022User ID\u0022, 25, 3) as s4,\n                        count(*),\n                        count_noise(*)\n                    from \u0022CustomerLedgerEntry\u0022\n                    group by grouping sets (s0, s1, s2, s3, s4)\n                    ) as substring_counts\n                group by s0, s1, s2, s3, s4\n                having length(sstr) = 3","data_source_name":"NAV_W1_TENANT_WS001"}}
{"description":"Invalid auth-token. This could be a result of the auth-token being incorrectly sent to the API backend, or the auth-token having been revoked. You can validate that your auth-token is still valid by visiting http://demo.aircloak.com:80/api_tokens.","success":false}

edongashi commented 4 years ago

The token validation logic has side effects. It can get out of sync during high load. I'll try to replicate.

edongashi commented 4 years ago

No reproduction as of yet. If someone manages to trigger this error please provide details here.

edongashi commented 4 years ago

@sebastian has reported that there are random failures from a user running queries under high load through the postgres interface. Could they possibly be related?

sebastian commented 4 years ago

Yes, that could very well be! I am wondering, and kind of hate the idea, whether we should cache the user credentials validations step in a GenServer or something similar (short expiry – 1 second or something?). That could dramatically reduce the database connection pool pressure under high load.

edongashi commented 4 years ago

My suspicion is that the large load of query updates pouring back from cloak exhaust the DB pool in air. Add to this the constant authentication requests. This exhaustion causes other updates to fail randomly... I don't have a reliable reproduction for any of these issues. In fact I don't have any kind of reproduction at all!

The jobs queue was well thought in the state updater. I may have found why it was freezing, because I have run into something similar when limiting analysis queries in cloak. I think :jobs freezes if you try to re-initialize the queue (if a supervisor restarts it). With proper configuration we can have the queue as a singleton and not touch it ever after.

sebastian commented 4 years ago

Related info from Durak from Bosch (the person @edongashi referred to above). She writes:

To update you, I had the similar problem on the remote server. It seems like postgresql server complains and stops. I reduced the number of queries per connection to below 20K (no parallel queries), and it seems working.

By "remote server" she means running the script from a server as opposed to from her own machine where her local network could have been the cause of the connections breaking/timing out.

So this could be related to the problem you refer to above, or it could be a matter of our Postgres connection state machine getting out of a valid state when you use it for too long (some edge case triggered after it having run for a while, or some leak not detected when running a moderate number of queries).

edongashi commented 4 years ago

it could be a matter of our Postgres connection state machine getting out of a valid state when you use it for too long (some edge case triggered after it having run for a while, or some leak not detected when running a moderate number of queries).

I'll test this locally to see if I can break it.

edongashi commented 4 years ago

I have been running 7 parallel clients each queuing 500 simultaneous queries. The system handles it extremely well and did not drop any request at all. The query is very simple, but with such a high load it receives more queries than it can chew. As such I don't think a high parallel load is causing the issues.

Maybe if the result of a query is too big it breaks the channel - I'll have to test that. Also need to determine if a long running client eventually malfunctions...

Aircloak / aircloak

Valid api token rejected #4655