Closed sebastian closed 6 years ago
This work should target the anonymization
branch (it's a branch off of master
in order to allow us to make intermediate releases that do not have half-baked anonymization fixes in place).
Also relevant comment on multiple levels of caching needed for date
's, and datetime
's: https://github.com/Aircloak/aircloak/issues/2485#issuecomment-392073797
I.e. a date(time) might be isolating at the level of a second, but not at the level of an hour etc.
@sebastian @obrok I have a couple of points I'd like to discuss here.
Checking of the column's isolated property is a potentially long running operation. It involves two queries per column, which are probably going to run reasonably fast on SQL databases, but I'm not sure about MongoDb.
The first thing I wonder is how much parallelism should we use here. If we issue too much queries in parallel, we could overload the client's database and cause a denial of service. If we run just one query at a time, it could become quite long until we fetch all the stats.
This leads me to the second question. What should we do if the column isolated property is not computed. We could wait until we have that property, but that could take awhile, e.g. if we're running just one query at a time, and the column is the last one in the queue.
Alternatively, we could compute isolated property on demand. But that could increase the query running time.
Yet another option is to simply proclaim that the column is isolated, until we know better.
Finally, we could have some hybrid scheme, in which on boot we form the queue for computing isolated property, but if some query demands it, we promote the required columns to the top of the queue. This would give us the best of all worlds, but it would require some complexity.
Finally, the question is how often should we recompute the isolated property? My first thought would be not to often, e.g. once per day.
Thoughts?
Finally, we could have some hybrid scheme, in which on boot we form the queue for computing isolated property, but if some query demands it, we promote the required columns to the top of the queue. This would give us the best of all worlds, but it would require some complexity.
I was thinking about something along these lines. Blocking the query until the queue gets to the column required for it seems unacceptably long. I also don't like telling the analyst that the column is isolating when we don't know yet, because the analyst might cache that in their head and stop using the column altogether. I guess we could also tell the analyst that the status is unknown, and they have to wait, but in that case we either need a mechanism to notify them that the status has been computed or they have to wait an undefined period of time. All in all, the other solutions seem flawed enough that I think the complexity is worth it in this case.
Finally, the question is how often should we recompute the isolated property? My first thought would be not to often, e.g. once per day.
I agree, maybe even once a week.
Finally, the question is how often should we recompute the isolated property? My first thought would be not to often, e.g. once per day.
I agree, maybe even once a week.
I think the property of a column is unlikely to change very often... i.e. running a query for a single random column per day would probably be fine too.
Although this might be a point where it's worthwhile introducing the notion of whether a data source is static or dynamically changing. If a data source is static, then rechecking makes no sense.
Another question/point:
What if there are multiple cloak's? Having each cloak
perform the checks independently seems rather wasteful. But sharing state between the cloak
s without passing it through the air
seems hard. And passing information through the air
comes as the risk of it being tampered with. Unless we cloak
signs the data or something? I vote for each cloak
checking the columns for now, but there is room for improvement here!
I was thinking about something along these lines.
Cool, I'll start working on this approach then.
I agree, maybe even once a week.
One problem with long intervals is that we need some persistency to keep track. Otherwise, we might end up skipping a beat or computing too frequently. There are a couple of fairly simple ways to do this so it wouldn't be a big problem.
In fact, now I'm starting to think that we should persist the last computed results to improve the restart experience, and avoid recomputing from scratch on restart. A fairly naive and easy way to do it is to store to a local file using :erlang.term_to_binary
and its counterpart. We already have a mounted folder (config), where we could store this data.
A more important issue about interval is that there is a theoretical window of time when we don't identify isolating column, which means a potential privacy issue. I'm guessing we're fine with this? But anyway, I think that this is one reason why we should refresh at least once per day.
What if there are multiple cloak's?
This is a good point. I agree that we should ignore it in the first pass, but I think we can make a lightweight CP, since air is the natural leader. Which means that each cloak would ask the air for permission to perform the recomputation, and air makes sure we don't have two of them running. Once the cloak is done, it just broadcasts the update via air, probably signed as you mentioned.
Although this might be a point where it's worthwhile introducing the notion of whether a data source is static or dynamically changing. If a data source is static, then rechecking makes no sense.
Also a good point. It could be included in configuration to help the administrators reducing the load in some cases.
A more important issue about interval is that there is a theoretical window of time when we don't identify isolating column, which means a potential privacy issue. I'm guessing we're fine with this? But anyway, I think that this is one reason why we should refresh at least once per day.
No we are not fine with potential privacy issues. The options we have (as I see it) are as described above namely:
Neither solution is ideal... I am more in favor of the second option (but it would be good to make this visible through a status update to the air
/query interface so the user understands what is going on!
Also a good point. It could be included in configuration to help the administrators reducing the load in some cases.
Yes, I think that is a good solution.
Note: we are likely to want to introduce some notion of HA for the air
instances in the future for our enterprise customers. Exactly how this works (master - slave
or master - master
or whatever) and what guarantees we are to give is still up for grabs. <---- this was related to the idea of air
synchronizing the nodes
No we are not fine with potential privacy issues.
I think @sasa1977 meant in case the dataset changes in such a way that a column becomes isolating, but our check is still some time away
I think @sasa1977 meant in case the dataset changes in such a way that a column becomes isolating, but our check is still some time away
Yes, that is the case I was thinking of.
I think @sasa1977 meant in case the dataset changes in such a way that a column becomes isolating, but our check is still some time away
Ah, I see. That I am fine with 😄
Are you all fine if I start conservatively and compute one column at a time?
Are you all fine if I start conservatively and compute one column at a time?
Seems fine to me
Are you all fine if I start conservatively and compute one column at a time?
In fact that is my preferred option!
I updated the description and added the remaining things to do. @obrok if you'll work on something of that, send me a note, and once you're done, please update the description.
Sure thing
Added one item to the list.
@sasa1977 I'm starting to work on fix isolation crashes on other data sources
FYI, I'm working on periodic update.
With some help from @cristianberneanu I fixed the problems for SQLServer and MySQL. I also included the isolator cache in compliance tests. One problem remains for Mongo, @cristianberneanu promised to take a look at it tomorrow, and if either he or I find a solution I'll be able to finally send a PR with all that.
improve efficiency in virtual tables (avoid recomputing columns from different tables)
I don't think we should do that, given that virtual tables are arbitrary queries. A particular virtual table might be very different from the base table in terms of data distribution.
I don't think we should do that, given that virtual tables are arbitrary queries. A particular virtual table might be very different from the base table in terms of data distribution.
Good point, I'll remove it from the list.
I'll take this one:
update data source state in air with whether the data source is ready or not
@sasa1977 please make sure this state is also available through the data source HTTP API endpoint. We want to performance test the system ahead of release, and our performance test suite will need to know that the host is ready and has cached the columns ahead of running the performance tests.
@sebastian I didn't explicitly add the support for the data source state through the API, because there was no need. Instead I extended the performance test to wait until the data source is in the online status (which also means that columns have been cached).
With that, I believe that this task is done, and I'll close it.
In order to protect against the equation attack we need the support the notion of whether a column is isolating or not. This process is described in the following issue: https://github.com/Aircloak/aircloak/issues/2485
@obrok has implemented an initial version that performs the check on a column live as part of the query execution. This is sufficient for the check, but insufficient for a productive deployment.
On data source changes we should check the columns provided by the data source individually for whether they are isolating or not. This information we subsequently cache such that we do not need to check them again at query time.
Periodically we should refresh this cache, but we can then do it on a column-by-column basis. The properties of a column are unlikely to change much.
@obrok has developed an API as part of this pull-request that provides the ability to detect if a column is isolating or not. He will use it in his ongoing work on the solution for the equation attack. The caching mechanism can be developed in parallel.
Things to take into account:
air
with apending
state during these operations. When all caches are warm then we can make the data source fully onlinecloak
s that serve a data source, then those that are fully online should be chosen preferentiallyThe basic implementation has been done in #2753. Here are the remaining things to do:
improve efficiency with multiple cloaksExtracted to 18.4 as #2791air
with whether the data source is ready or not