Caching state on isolating columns

sebastian commented 6 years ago

In order to protect against the equation attack we need the support the notion of whether a column is isolating or not. This process is described in the following issue: https://github.com/Aircloak/aircloak/issues/2485

@obrok has implemented an initial version that performs the check on a column live as part of the query execution. This is sufficient for the check, but insufficient for a productive deployment.

On data source changes we should check the columns provided by the data source individually for whether they are isolating or not. This information we subsequently cache such that we do not need to check them again at query time.

Periodically we should refresh this cache, but we can then do it on a column-by-column basis. The properties of a column are unlikely to change much.

@obrok has developed an API as part of this pull-request that provides the ability to detect if a column is isolating or not. He will use it in his ongoing work on the solution for the equation attack. The caching mechanism can be developed in parallel.

Things to take into account:

When a query is run, but no cache exists, we can perform the check live, and cache the result
We should update the data source state in the air with a pending state during these operations. When all caches are warm then we can make the data source fully online
If there are multiple cloaks that serve a data source, then those that are fully online should be chosen preferentially

The basic implementation has been done in #2753. Here are the remaining things to do:

[X] add periodic recomputation
[x] add some tests for the cache process
[x] fix isolation crashes on other data sources
[x] fix isolation crashes for virtual tables
[x] use cache process in all tests (unit and compliance), to ensure that isolation doesn't break on other data sources - I made it only use the cache in compliance, as the data sources are static there, making it easy to manage, while in regular tests we create a bunch of tables on the flu (yapee)
[x] persist the cache to avoid repriming on restart
[x] ~~improve efficiency with multiple cloaks~~ Extracted to 18.4 as #2791
[x] update data source state in air with whether the data source is ready or not

sebastian commented 6 years ago

This work should target the anonymization branch (it's a branch off of master in order to allow us to make intermediate releases that do not have half-baked anonymization fixes in place).

sebastian commented 6 years ago

Also see: https://github.com/Aircloak/aircloak/pull/2737#discussion_r190593470

sebastian commented 6 years ago

Also relevant comment on multiple levels of caching needed for date's, and datetime's: https://github.com/Aircloak/aircloak/issues/2485#issuecomment-392073797

I.e. a date(time) might be isolating at the level of a second, but not at the level of an hour etc.

sasa1977 commented 6 years ago

@sebastian @obrok I have a couple of points I'd like to discuss here.

Checking of the column's isolated property is a potentially long running operation. It involves two queries per column, which are probably going to run reasonably fast on SQL databases, but I'm not sure about MongoDb.

The first thing I wonder is how much parallelism should we use here. If we issue too much queries in parallel, we could overload the client's database and cause a denial of service. If we run just one query at a time, it could become quite long until we fetch all the stats.

This leads me to the second question. What should we do if the column isolated property is not computed. We could wait until we have that property, but that could take awhile, e.g. if we're running just one query at a time, and the column is the last one in the queue.

Alternatively, we could compute isolated property on demand. But that could increase the query running time.

Yet another option is to simply proclaim that the column is isolated, until we know better.

Finally, we could have some hybrid scheme, in which on boot we form the queue for computing isolated property, but if some query demands it, we promote the required columns to the top of the queue. This would give us the best of all worlds, but it would require some complexity.

Finally, the question is how often should we recompute the isolated property? My first thought would be not to often, e.g. once per day.

Thoughts?

obrok commented 6 years ago

Finally, we could have some hybrid scheme, in which on boot we form the queue for computing isolated property, but if some query demands it, we promote the required columns to the top of the queue. This would give us the best of all worlds, but it would require some complexity.

I was thinking about something along these lines. Blocking the query until the queue gets to the column required for it seems unacceptably long. I also don't like telling the analyst that the column is isolating when we don't know yet, because the analyst might cache that in their head and stop using the column altogether. I guess we could also tell the analyst that the status is unknown, and they have to wait, but in that case we either need a mechanism to notify them that the status has been computed or they have to wait an undefined period of time. All in all, the other solutions seem flawed enough that I think the complexity is worth it in this case.

Finally, the question is how often should we recompute the isolated property? My first thought would be not to often, e.g. once per day.

I agree, maybe even once a week.

sebastian commented 6 years ago

Finally, the question is how often should we recompute the isolated property? My first thought would be not to often, e.g. once per day.

I agree, maybe even once a week.

I think the property of a column is unlikely to change very often... i.e. running a query for a single random column per day would probably be fine too.

Although this might be a point where it's worthwhile introducing the notion of whether a data source is static or dynamically changing. If a data source is static, then rechecking makes no sense.

Another question/point: What if there are multiple cloak's? Having each cloak perform the checks independently seems rather wasteful. But sharing state between the cloaks without passing it through the air seems hard. And passing information through the air comes as the risk of it being tampered with. Unless we cloak signs the data or something? I vote for each cloak checking the columns for now, but there is room for improvement here!

sasa1977 commented 6 years ago

I was thinking about something along these lines.

Cool, I'll start working on this approach then.

I agree, maybe even once a week.

One problem with long intervals is that we need some persistency to keep track. Otherwise, we might end up skipping a beat or computing too frequently. There are a couple of fairly simple ways to do this so it wouldn't be a big problem.

In fact, now I'm starting to think that we should persist the last computed results to improve the restart experience, and avoid recomputing from scratch on restart. A fairly naive and easy way to do it is to store to a local file using :erlang.term_to_binary and its counterpart. We already have a mounted folder (config), where we could store this data.

A more important issue about interval is that there is a theoretical window of time when we don't identify isolating column, which means a potential privacy issue. I'm guessing we're fine with this? But anyway, I think that this is one reason why we should refresh at least once per day.

What if there are multiple cloak's?

This is a good point. I agree that we should ignore it in the first pass, but I think we can make a lightweight CP, since air is the natural leader. Which means that each cloak would ask the air for permission to perform the recomputation, and air makes sure we don't have two of them running. Once the cloak is done, it just broadcasts the update via air, probably signed as you mentioned.

Although this might be a point where it's worthwhile introducing the notion of whether a data source is static or dynamically changing. If a data source is static, then rechecking makes no sense.

Also a good point. It could be included in configuration to help the administrators reducing the load in some cases.

sebastian commented 6 years ago

A more important issue about interval is that there is a theoretical window of time when we don't identify isolating column, which means a potential privacy issue. I'm guessing we're fine with this? But anyway, I think that this is one reason why we should refresh at least once per day.

No we are not fine with potential privacy issues. The options we have (as I see it) are as described above namely:

reject query and say we don't yet have the state, please try again later
perform the check live (+ cache the result for the future) and make the query slow

Neither solution is ideal... I am more in favor of the second option (but it would be good to make this visible through a status update to the air/query interface so the user understands what is going on!

sebastian commented 6 years ago

Also a good point. It could be included in configuration to help the administrators reducing the load in some cases.

Yes, I think that is a good solution.

Note: we are likely to want to introduce some notion of HA for the air instances in the future for our enterprise customers. Exactly how this works (master - slave or master - master or whatever) and what guarantees we are to give is still up for grabs. <---- this was related to the idea of air synchronizing the nodes

obrok commented 6 years ago

No we are not fine with potential privacy issues.

I think @sasa1977 meant in case the dataset changes in such a way that a column becomes isolating, but our check is still some time away

sasa1977 commented 6 years ago

I think @sasa1977 meant in case the dataset changes in such a way that a column becomes isolating, but our check is still some time away

Yes, that is the case I was thinking of.

sebastian commented 6 years ago

I think @sasa1977 meant in case the dataset changes in such a way that a column becomes isolating, but our check is still some time away

Ah, I see. That I am fine with 😄

sasa1977 commented 6 years ago

Are you all fine if I start conservatively and compute one column at a time?

obrok commented 6 years ago

Are you all fine if I start conservatively and compute one column at a time?

Seems fine to me

sebastian commented 6 years ago

Are you all fine if I start conservatively and compute one column at a time?

In fact that is my preferred option!

sasa1977 commented 6 years ago

I updated the description and added the remaining things to do. @obrok if you'll work on something of that, send me a note, and once you're done, please update the description.

obrok commented 6 years ago

Sure thing

sebastian commented 6 years ago

Added one item to the list.

obrok commented 6 years ago

@sasa1977 I'm starting to work on fix isolation crashes on other data sources

sasa1977 commented 6 years ago

FYI, I'm working on periodic update.

obrok commented 6 years ago

With some help from @cristianberneanu I fixed the problems for SQLServer and MySQL. I also included the isolator cache in compliance tests. One problem remains for Mongo, @cristianberneanu promised to take a look at it tomorrow, and if either he or I find a solution I'll be able to finally send a PR with all that.

obrok commented 6 years ago

improve efficiency in virtual tables (avoid recomputing columns from different tables)

I don't think we should do that, given that virtual tables are arbitrary queries. A particular virtual table might be very different from the base table in terms of data distribution.

sasa1977 commented 6 years ago

I don't think we should do that, given that virtual tables are arbitrary queries. A particular virtual table might be very different from the base table in terms of data distribution.

Good point, I'll remove it from the list.

sasa1977 commented 6 years ago

I'll take this one:

update data source state in air with whether the data source is ready or not

sebastian commented 6 years ago

@sasa1977 please make sure this state is also available through the data source HTTP API endpoint. We want to performance test the system ahead of release, and our performance test suite will need to know that the host is ready and has cached the columns ahead of running the performance tests.

sasa1977 commented 6 years ago

@sebastian I didn't explicitly add the support for the data source state through the API, because there was no need. Instead I extended the performance test to wait until the data source is in the online status (which also means that columns have been cached).

With that, I believe that this task is done, and I'll close it.

Aircloak / aircloak

Caching state on isolating columns #2738