Avoid crashing - Githubissues

sebastian commented 7 years ago

The cloak crashing because Erlang doesn't like running out of memory is sad.

A rudimentary solution would be to have a cloak-process monitoring the available memory on the host on which it is deployed. When it sees the host is about to run out of memory it can:

reject new queries
kill an existing query with an "Out of memory" error to bring the memory consumption back under control

Ideally the process would have a notion of the heap size of the individual queries, and kill one of the queries where the amount of data is still growing, rather than one that has plateaued, or is even in the processing of shrinking it's footprint.

Most importantly though we should fight hard to not crash the VM.

This is increasingly important now that we support in-cloak database emulation which puts the cloak's under a lot of additional strain when it comes to memory usage.

obrok commented 7 years ago

@sasa1977 shouldn't this crash the offending process first if there is one process with huge mem usage? It seems like that would be the case in our system

sebastian commented 7 years ago

We have seen lots of cloak-VM crashes due to running out of memory (especially in the old Cloak Zero system). Rather than crashing processes selectively it crashes the whole VM, without giving the VM any chance to reduce the memory usage first.

obrok commented 7 years ago

That seems un-erlang in spirit :(

sasa1977 commented 7 years ago

As a quick fix, we can set the query runner process max heap size, thus causing the process to crash if it takes up too much memory. See here for details.

sebastian commented 7 years ago

As a quick fix, we can set the query runner process max heap size, thus causing the process to crash if it takes up too much memory. See here for details.

Hm, this is a bit tricky though.

When do we set it?
- Based on available memory when the query starts?
Do we adjust it during while the query is running depending on how many other queries are running?
Do we adjust it during while the query is running depending on how much other load there is on the host?

sasa1977 commented 7 years ago

Other than that, Erlang doesn't really deal automatically with high memory usage, so we need to do it ourselves. Currently our problem is not in the external load, but more likely in the non-optimized code, so we need to handle that.

Past that point, a typical solution is to refuse the request when some threshold is reached. IIRC, Ulf's jobs can be used for this, but I never worked with it so not 100% sure.

cristianberneanu commented 7 years ago

This is a very hard problem. Out of memory exceptions can not be caught in Erlang. Putting limit on processes will have limited effects, as we can still have memory spikes when multiple queries are run. Monitoring the host memory will also be hard when multiple cloaks are running at the same time (which cloak kills what?)

sebastian commented 7 years ago

Each cloak monitoring memory usage on the host might cause multiple co-located cloak's killing queries, ending up killing more than is needed. This is probably still favourable to the VMs crashing. We could also recommend that cloak's do not run on the same hardware which probably is recommendable anyway.

sasa1977 commented 7 years ago

This is a very hard problem. Out of memory exceptions can not be caught in Erlang. Putting limit on processes will have limited effects, as we can still have memory spikes when multiple queries are run. Monitoring the host memory will also be hard when multiple cloaks are running at the same time (which cloak kills what?)

This problem can be approached systematically and gradually if we know, or we can estimate how much memory a query execution will take. By keeping track of all currently running queries we know that we currently need N * M memory, where M is the estimated memory usage of the query execution. When the next query enters the system, we can decide upfront if we have enough of memory to accommodate it or not. If not, we can immediately reject it, or store it into a pending queue.

Estimating query's memory usage is of course hard, but we can start very dumb and simple by choosing some value. We can then improve on this gradually. For example, after compiling the query, we should know whether we can stream, or we need to expand, and that can affect the outcome of the estimate. It becomes complex, but it can be done in phases and in the straightforward manner.

I propose we remove the multiple cloaks problem off the table. Cloak has a potentially large memory requirements, so supporting running more of them on the same machine should be discouraged. The fact we do this ourselves is just our internal hack.

cristianberneanu commented 7 years ago

we can estimate how much memory a query execution will take.

This depends on the query type and data set size.

As a quick fix, we can set the query runner process max heap size, thus causing the process to crash if it takes up too much memory.

Memory usage for a query is spread over multiple processes.

sebastian commented 7 years ago

But do we even need to be clever at all? Can't we just look at the ground truth provided by the host system, and then kill queries based on that? Rather than trying to estimate future memory load?

If we want to be clever, we can still put the queries we have to kill into a queue, and retry once the memory usage drops (failing altogether if the situation doesn't improve)

sebastian commented 7 years ago

Memory usage for a query is spread over multiple processes.

Oh, I somehow thought it was mostly all running in the context of the same process (of course the DB connection is a different one, but besides that, isn't it all done pretty much serially?)

sasa1977 commented 7 years ago

Can't we just look at the ground truth provided by the host system, and then kill queries based on that?

You could do something like this by having a separate process which monitors the memory usage and decides to kill queries once some threshold is reached. However, I wouldn't advise such approach, because I find it harder to reason about.

Typical recommendations I've seen is to decide at the system boundaries whether you have the capacity to handle the request or not. To do this, we need to have some knowledge about the request's memory requirements. I'm assuming here that we don't have constant memory usage, i.e. based on the input (type of query, datasource type, number of rows), the memory usage can grow indefinitely. If that's the case, and we can't flatten our memory usage, I think we should try to estimate it, and make the decision based on that estimate.

This depends on the query type and data set size.

As I said, we can refine our estimation based on that.

Memory usage for a query is spread over multiple processes.

That's a technical detail. But I'm curious which processes do you talk about? I can only think of data source connections and query runner.

cristianberneanu commented 7 years ago

That's a technical detail. But I'm curious which processes do you talk about? I can only think of data source connections and query runner.

A normal query only runs in the context of the database connection owner process. An emulated process runs in the context of the database connection owner process and in the context of the temporary process launched by the Query Runner. If we want to execute joins in parallel, we need to launch additional processes.

sebastian commented 7 years ago

I'm assuming here that we don't have constant memory usage, i.e. based on the input (type of query, datasource type, number of rows), the memory usage can grow indefinitely. If that's the case, and we can't flatten our memory usage, I think we should try to estimate it, and make the decision based on that estimate.

How do you want to estimate the memory usage of a process where the memory requirements grows infinitely? If it potentially does that, we would have to outright reject it upon receiving it.

sasa1977 commented 7 years ago

How do you want to estimate the memory usage of a process where the memory requirements grows infinitely? If it potentially does that, we would have to outright reject it upon receiving it.

But it should grow deterministically right? Based on the input size, query type, the fact whether we're emulating or not, row expanders and such, we can make some estimate. I'm not saying it's easy, but some number can be computed.

sebastian commented 7 years ago

Based on the input size

I think the problem is the input size being unknown. We are potentially querying an evolving dataset. Hence any past knowledge might be invalid. Furthermore, it's greatly influenced by the WHERE-clause parameters in ways that are hard to predict (very specific to the underlying data).

I have the feeling that trying to estimate this is going to do more harm than good, and that we are better of with a much dumber reactive solution.

sasa1977 commented 7 years ago

The issue I have with doing it later is that the system can accept a query which will eventually fail, process it for a long time, and then fail due to OOM. Imagine you take a seat at a restaurant, make your order, and then an hour later the waiter tells you they don't have the capacity to handle your order. That has got to suck badly.

The estimate approach is IMO more fair because it rejects a request immediately, or it can hold it off until there is capacity.

But let's take a step back here. I assume the issue happens because of emulated queries, right? If that's the case, we should maybe consider how can we constrain memory usage of such queries. I don't think we should allow a single query to consume infinite amount of memory, because the system won't be able to serve all sorts of queries.

sebastian commented 7 years ago

But let's take a step back here. I assume the issue happens because of emulated queries, right?

Actually I haven't seen a crash for a long time. The last time I experienced crashes was when querying the NYC Taxi database. It is the largest dataset we have by far. It rapidly crashes if you don't heavily limit the data you want to query (which makes querying the dataset a lot less interesting).

The datasets we are going to query using the emulation engine are going to remain small for a while. That being said, emulated query support is what made me write this issue.

I don't think we should allow a single query to consume infinite amount of memory

We need to spend some time this spring to see if we can come up with fully streaming anonymization approaches for our most commonly used functions. Until we do we are not going to be able to process truly large datasets. Alternatively (additionally) we can start to selectively sample data.

The issue I have with doing it later is that the system can accept a query which will eventually fail, process it for a long time, and then fail due to OOM.

True, that is going to be annoying. I wonder though if it will actually turn out to be a problem if we kill the most recent queries first (last-in-first-kill)? That way the extra execution time needed is minimised, and we avoid prematurely rejecting queries that in practise could have executed without problem.

Furthermore we have the problem that a single query might kill the cloak all by itself. That is a) hard to anticipate, and b) still needs to be handled gracefully (again, the NYC taxi database being an example)

We are also likely to see ourselves executing in partially shared environments, where other applications might influence the memory available to us. Hence I doubt we can rely exclusively on heuristics based on our internal state.

sasa1977 commented 7 years ago

True, that is going to be annoying. I wonder though if it will actually turn out to be a problem if we kill the most recent queries first (last-in-first-kill)? That way the extra execution time needed is minimised, and we avoid prematurely rejecting queries that in practise could have executed without problem.

Any query could be the culprit. Perhaps it's the oldest one, perhaps it's the newest one, or maybe it's the one with the largest trend of heap increase. Properly choosing the victim could be tricky.

We need to spend some time this spring to see if we can come up with fully streaming anonymization approaches for our most commonly used functions. Until we do we are not going to be able to process truly large datasets.

Another option is to use disk for intermediate data.

sebastian commented 7 years ago

Any query could be the culprit. Perhaps it's the oldest one, perhaps it's the newest one, or maybe it's the one with the largest trend of heap increase. Properly choosing the victim could be tricky.

And in my head that's actually itself a reason not to try all too hard to be smart, and rather just have a simple heuristic for killing the from the last one forward.

The goal would then be to:

keep the system alive
provide a signal to the admin that the system is no longer provisioned appropriately, and needs more memory
tell the analyst that they might benefit from restricting the dataset being queried

sebastian commented 7 years ago

Ok, this issue now has higher priority!

Felix and I met with Michael from TeamBank and wrote some queries tonight. Since the performance was incredibly slow we ended up running multiple queries in parallel. As a result the cloak kept on crashing. Even a simplistic approach that kills more than is needed is better than a cloak that crashes.

For reference, I believe the VM the cloak is running on has 32GB of ram. It might very well be that the VM is shared with the MongoDB instance too, which obviously complicates things. The number of users is around 2000 and the number of accounts 8000, so the dataset as such really isn't all that large.

cristianberneanu commented 7 years ago

The number of users is around 2000 and the number of accounts 8000,

Something else seems wrong in that case. With such a small amount of data everything should fit in memory with space left to spare, even when doing JOINs. Is there some memory limit put on the cloak?

sebastian commented 7 years ago

I was running a number of parallel queries, maybe 4 or so? And they were iterating through the transactions (umsatz) table multiple times... I don't have any stats about the number of transactions per account, but if we let's for the sake of argument say it's 100. Then we are at 8 million rows being loaded many times in parallel.

It could of course still be that something else is off.

Unfortunately our main sysadmin contact is on vacation. I'll see if there is anyone else that can give us logs.

cristianberneanu commented 7 years ago

Was there a lot of data selected in each row?

sebastian commented 7 years ago

The queries were different variations of:

SELECT 
  bucket(income by 200 align middle) as income_class, 
  count(*)
FROM (
  SELECT inhaberId, median(monthly_income) as income FROM (
    SELECT 
      inhaberId,
      year(buchungsDatum) as y,
      month(buchungsDatum) as m,     
      sum(betrag) as monthly_income
    FROM umsatz
    WHERE betrag >= 0 and betrag < 1000000000
    GROUP BY inhaberId, year_month
  ) as income_by_month
  GROUP BY inhaberId, y, m
) as user_incomes
GROUP BY income_class

Sometimes additionally with JOINing another table. Sometimes with an additional WHERE-clause in the innermost subquery.

So in other words, not a lot of data selected per row. Only a couple columns.

sasa1977 commented 7 years ago

but if we let's for the sake of argument say it's 100. Then we are at 8 million rows being loaded many times in parallel.

Wouldn't with 8000 accounts this be more like 800,000 rows?

The queries were different variations

I wonder if we have mostly similar queries whether we end up fetching exactly the same data from multiple queries. If that is indeed the case, perhaps we could consider some way of tapping into an already existing retrieval/emulation, rather than starting the new one which is going to retrieve the same data.

Obviously this is a limited solution that can only work for some favorable cases, but for such cases it might help significantly.

sebastian commented 7 years ago

You are right. 800k rows, obviously... Not 8 million... 🥇

Tapping into an ongoing query emulation sounds complicated... maybe some query-level caching could help, but caching is also super easy to get wrong/stale.

cristianberneanu commented 7 years ago

800k rows should not pose a problem. I worry there is a bug somewhere that makes things very inefficient.

I don't like at all the thought of killing queries. At most, we should paused until there are enough resources.

But before trying more extreme options, I would like to first look at problematic queries and try to improve some of the memory hungry steps. In the above query, I see a few areas that might be problematic:

median is very expensive (CPU and RAM)
- JOINs are done in parallel and maybe they should be done sequential
- Elixir streams might not be as memory friendly as we thought
- grouping in memory might also be too expensive at the moment

sebastian commented 7 years ago

Irrespective of all of the above (which features are expensive and that killing queries isn't nice) we still need a way to prevent that cloak's crash! However good we make the individual components we are still likely to end up in situations where we a number of parallel queries exceed the resources the cloak is given.

Crashing the whole cloak is significantly worse than failing the query with a user-legible error!

So yes: we should look at all of the above individually, and improve their performance, but that is orthogonal to this issue.

sasa1977 commented 7 years ago

This is going to tricky. We could start by limiting the amount of parallel not-streamed queries (I think currently it's only emulated, right?), because they have larger memory requirements. We could be conservative, and start with allowing just a few (maybe even only one?) such queries simultaneously.

Another thing we could do is introduce a query progress UI, which would allow analysts to easily see their currently running queries (even after they refresh the page, or visit the site in another tab). They could then see that say one query is running, while others are queues, and manually cancel the ones which they want.

That approach could work well for the case you mentioned:

Since the performance was incredibly slow we ended up running multiple queries in parallel.

Clearly, we don't support multiple parallel queries in this case (because the cloak crashed). So the best we can do is to put new queries on hold and allow you to e.g. cancel the slow query.

sebastian commented 7 years ago

This is going to tricky. We could start by limiting the amount of parallel not-streamed queries (I think currently it's only emulated, right?), because they have larger memory requirements. We could be conservative, and start with allowing just a few (maybe even only one?) such queries simultaneously.

Maybe I am naïve, but I don't see what the trickiness here is. I am all in favour of something that is simple and stupid.

Trying to be clever here is going to cause more trouble than it's worth
Limiting the amount of parallelism is at best a bandaid. It doesn't actually solve the problem
Pausing queries (like suggested by @cristianberneanu doesn't really solve the problem either, as it doesn't reduce the memory pressure. Of course if we notice the problem early enough we could pause a query while waiting for another to finish. That would work. And then if pausing isn't enough we could kill the paused queries... but this might already be too much smarts! Ideally this should only be enough to cover the worst worst case that over time we hopefully hardly ever get into.

That approach could work well for the case you mentioned:

Since the performance was incredibly slow we ended up running multiple queries in parallel.

In fact it makes the case worse. For a while running parallel queries worked well. For example I could run other quick probing queries while waiting for a long one to finish. Not allowing parallel queries here would just make it all worse.

cristianberneanu commented 7 years ago

What is your proposed solution to this? Since a single query can crash the cloak, everything we do will be a band-aid. Removing JOIN parallelism helps reduce memory spikes and also isolates the query to a single process. Otherwise, we might not even know what to kill/pause.

sebastian commented 7 years ago

My suggestion is actually exactly the same as in the original issue:

we monitor memory at a fixed interval (say second or so?)
we keep a list of running queries
if the memory checks indicate that the host is about to run out of memory for whatever reason, we start killing queries from the most recent query forward until the memory stabilises itself
killing a query would look very much like our current implementation for cancelling a query, only that the error reported to the user would be a slightly more wordy version of "Out of memory"

This might cause a query to be killed even if it is the only query running. But this seems like the correct behaviour. If the cloak can't cope with that particular query because it has insufficient resources, then well, that query can't be run until the cloak has been given more resources... it's much better to get an "Out of error" message that can be acted upon than have the cloak crash mysteriously beneath your feet!

sebastian commented 7 years ago

Of course, the above might still fail if memory grows faster than we can react, but we are likely to catch the most notorious cases like this.

Once we get a little smarter we could also delay starting a query if a certain minimum amount of memory isn't available (be it 1GB of memory, or whatever)...

cristianberneanu commented 7 years ago

My reservations regarding the above method are:

polling for memory usage seems unreliable
it doesn't help when running a single query
keeping JOIN parallelism increases the amount of memory a single query can consume
in order to be able to handle out-of-memory errors reliably, it means we need to practically reserve a certain amount of RAM (something like 20%) as unusable
instead of cancelling some queries, why can't we restart them later, once memory usage drops?
how will this interact with other software on the system? what if the same machine also hosts the database or other cloaks? what if there is a memory limit on the container?

sasa1977 commented 7 years ago

After thinking about this a bit more I agree that it might solve some cases, and the implementation might be fairly simple. I'm fairly certain that this won't be a sufficient solution, but we can expand later as needed.

I think we should use a fairly low interval, probably only a few ms. I don't expect this to cause a significant problem in most cases, since OOM is an exceptional situation, and hence the process will mostly only poll for memory usage.

sebastian commented 7 years ago

it doesn't help when running a single query

It absolutely does! It's much better that the query fails with an "There wasn't enough memory, please adjust" than the cloak crashing and the analyst having to contact their admin to get it back up, which might take hours...

in order to be able to handle out-of-memory errors reliably, it means we need to practically reserve a certain amount of RAM (something like 20%) as unusable

Either that, or we look at memory growth and kill a query when the memory growth is such that we can expect to run out of memory within N checking intervals.

instead of cancelling some queries, why can't we restart them later, once memory usage drops?

We certainly could. The reason not to would be to keep the mechanism simple and stupid. Re-attempting adds new corner cases. That's certainly something worth looking into, but trying to prevent crashing has higher priority in my books.

how will this interact with other software on the system? what if the same machine also hosts the database or other cloaks? what if there is a memory limit on the container?

This is a valid concern. Ideally the cloak would be deployed in isolation (and we should recommend this to all customers). If it isn't then it is still interesting to detect and kill queries when other apps start consuming excessive amounts of memory. Again the alternative presumably would have been the cloak crashing, which is the worst outcome.

sebastian commented 7 years ago

Another gruelling story from the trenches: I had a one hour call scheduled with TeamBank yesterday where we wanted to go through some queries they had written. They had some questions about how to query for certain things. The first query we ran crashed the cloak, and brought air to a crawl too. This left us without a working cloak and with an air that for all intents and purposes didn't work either for the rest of the hour... this is the kind of stuff I would like to at least attempt to prevent from happening again.

cristianberneanu commented 7 years ago

Sad to hear that. What I am curious about: can't the docker containers be made to restart after a crash?

sebastian commented 7 years ago

Sometimes they seem to do that... I think a second problem here is that they are using links to connect the containers, since our user guide explained it that way. If the cloak container restarts while the air is down (assuming they restart), then it won't be able to connect since the container it tries to link to doesn't exist... I am not sure it can recover from this automatically.

sasa1977 commented 7 years ago

Container can be made to restart on crash, either by a docker option, or by using e.g. init.d. If cloak links to air, and air is up, then upon restart it should link properly. However, if cloak is restarted when air is down, then it won't work. Also, if air is restarted, cloak won't be able to reconnect.

Linking is used in guides because it's the simplest way to connect two containers. But I don't think it's really suitable for proper production. I'm assuming that air and cloak will usually run on separate machines, and standard networking will be used to access air from the cloak. In such cases, both containers can restart separately.

sebastian commented 7 years ago

I have told them not to use links, but I think that's the way they have done it. I'll reiterate for when there is a proper production deployment.

sebastian commented 7 years ago

Hm... it seems they are running MongoDB, air, air_db, cloak all on a VM with 2GB of ram. The dataset is small though. About 600mb.

I have asked them to bump the memory to 16GB. Hopefully that reduces the number of crashes.

Aircloak / aircloak

Avoid crashing #953