Query changes - Githubissues

sebastian commented 8 years ago

Dear @Aircloak/developers

We have discussed introducing support for non-Turing Machine queries for a long time now. The benefits of non-TM queries include faster executions, ease of use (if we implement them well), and as it turns out, far better privacy properties.

It’s the latter that now forces our hand when it comes to the timing of this introduction. We now know that the privacy of our current Lua sandbox can be broken quite easily. Paul has a list of attacks that give the real results with very high confidence.

These problems arise when the user is malicious, and the following two factors are in place:

the user is allowed to freely write their own tasks (Turing complete – i.e. Lua sandbox)
the user is allowed to specify exact boundaries (for example ages > X)

Strong privacy protection being the core our offering, we cannot continue selling our Lua sandbox with a straight face. Or rather, we could sell it to people who are allowed to access the data in the first place, but that defeats the purpose of our system.

Paul has been working on new anonymization for a non-TM query approach in parallel, and the feedback he has gotten from the French government level privacy authority CNIL is excellent. They seem to normally be quite a hard nut to crack.

These past months has seen us undergo a lot of changes. In hindsight they have all been, in my opinion, changes for the better. I believe this change is going to be one of those too. Be that as it may, there is no hiding that this is a very disruptive change. For that I apologize. This change leaves us, again, without a working system – but it is a change we have to make.

The immediate upsides are that the vast majority of people we speak to prefer a more traditional query interface (SQL) to the TM one. Now we can offer them this. By supporting a subset of SQL and controlling the server-side implementation of it, we get around the privacy problems that arise from people being able to freely write their own tasks. The flipside is that there will be queries that we cannot directly support out of the gate, and that the analysts will not be able to implement themselves until we come up with a solution for them. The second privacy problem mentioned above (that of user determined parameter boundaries) is something Paul has addressed in his work. It requires a more complex implementation, but has good privacy properties. Thankfully this is something we can introduce over time. As the server-side implementation is under our control, we can improve the query performance and the quality of anonymization, without at all affecting the way users interact with our system.

What changes

The important questions are: what remains the same, and what changes have to be made to our system.

What remains the same

Thankfully most of our current work can be immediately reused. The air to cloak communication remains the same. So do the task execution progress notifications. We also still need the ability to perform post-processing of results, save them to the database, and likewise have to load data from the database inside the cloaks for processing.

What changes

The query interface exposed to the analyst changes. Since the way one approaches querying when using SQL is different from when one writes complex tasks in Lua, we have to reassess how this is to be done best. If nothing else, we should start with something as simple as possible. The imperative is to get to a working system as soon as we can.

The query execution changes too. When we are the ones implementing the query mechanics inside the cloaks, the Lua sandbox as it stands today ends up a liability rather than an aid.

Steps going forward

We need to lock down the subset of SQL that we initially want to support, and be able to parse it (I have given this some thought that can be found in this separate document. Please read and comment)
Once we have defined the subset needed, we can create the simplest possible implementation of this in the cloak. The initial functionality we provide very much be a copy of how we solved the issues in the Lua sandboxes today
As soon as this stands, we should aggressively push this out to users to validate that it provides value, and collect further general feedback
The beauty of having a fixed set of queries that we support is that we can improve the anonymization and performance over time, without anything changing for the customers
As we are confident in our choice of SQL we start implementing Paul’s more advanced anonymization, step by step
Then we get into an iterative phase of adding necessary SQL support as needed by customers, and improving performance where it is needed the most
Two tier approach

I envision a two tier approach to our implementation. We are confident that we can load data out row by row like we are doing it today for the sandbox. The first implementation of any major functions we want to support (like generating averages, or histograms) should be on this row-by-row data inside the cloak. This is what I call the first tier, or cloak-support. It isn’t the most efficient way of doing things, but it allows us to add support for new backends by doing nothing more than adding the ability to load data per user and reformatting it in a consistent way for the cloak.

The second tier, or native support, is offloading queries to the database servers themselves. This should only be done as we understand which queries and which backends are the most popular, and can be done step by step where it makes sense. Which databases and which queries we do this for depends on who our customers become, what their most pressing needs are, and who pays the most.

Again, I am very aware that this is a disruptive change. I am confident that the changes are for the better and in fact necessary. But we are reliant on your thoughts and input when deciding how best to do this.

https://trello.com/c/HRlUI8EJ/6715-query-changes

cristianberneanu commented 8 years ago

So we are going to completely remove the Lua sandbox from the cloak, right?

The query interface exposed to the analyst changes

Will the users input the query directly or will they have to assemble the query using a visual editor? Will we accept SQL queries directly from another application?

I envision a two tier approach to our implementation.

There are actually more options here:

We send the translated query directly to the database. Pros: no more task coordinator, very fast. Cons: no task progress report, database dependent.
We send the per-user translated query directly to the database. Pros: simpler task coordinator, fast. cons: database dependent.
We read data row-by-row and we execute the query ourselves. Pros: database agnostic, flexible. Cons: slow, complex implementation.
We keep the current code and we translate SQL queries to Lua code (in air or cloak). Pros: database agnostic, flexible, less code to write. Cons: slow, complex implementation.

sebastian commented 8 years ago

So we are going to completely remove the Lua sandbox from the cloak, right?

yes, Lua sandbox is completely removed.

Will the users input the query directly or will they have to assemble the query using a visual editor? Will we accept SQL queries directly from another application?

I suggest accepting SQL (rather than dropdown/visual editor), and I would really like to get us to the point where the SQL comes from another tool. It allows the coak to fit very transparently into existing systems, which really is where it should live.

It does require that our systems understanding of SQL is very good. I think most things in SQL can be done quite well in a cloak. The problem arises where the results aren't exactly what the user would expect due to the answers being a subset of the real data. We have to see where/if the abstraction breaks down.

1., 2., 3., 4.

I was actually thinking a combination of 1. and 3. As in we craft a single query that runs against the database, and puts as much work there as possible. But then for example the anonymization and our custom aggregators all happen in the cloak. Pros: very fast in most cases, mostly database agnostic, single implementation of anonymization functionality. Cons: we still need to move a significant amount of data – but at least only the data explicitly needed.

The benefit of 1. + 3. is that the core logic remains in the cloak and can therefore run against most backends, since the only requirement is that we get the data out. We can then step by step move the most data aggressive and commonly used functions into the database itself as custom aggregators.

We can still do progress reports in the phase where we compute across and aggregate the user data.

cristianberneanu commented 8 years ago

I would really like to get us to the point where the SQL comes from another tool

In this case, is there any point in any sending results to air anymore? It seems like the current design of processing results will break down.

sebastian commented 8 years ago

I would really like to get us to the point where the SQL comes from another tool

In this case, is there any point in any sending results to air anymore? It seems like the current design of processing results will break down.

You mean whether the air is needed, or whether we can allow people to query the cloak directly? I think querying the cloak directly can probably be allowed. It has no implications on privacy. Allowing post-processing to happen on the cloak takes load off of our systems too.

Running everything through the air allows us to intercept queries and record what is being done to better improve our system, but the same can be achieved if people are directly querying the cloaks too.

Hehe, maybe this change will in the end result in there only being a unified "aircloak" :D

cristianberneanu commented 8 years ago

OK, so the 'cloak' in "It allows the cloak to fit very transparently into existing systems, which really is where it should live." actually refers to the entire system.

sebastian commented 8 years ago

OK, so the 'cloak' in "It allows the cloak to fit very transparently into existing systems, which really is where it should live." actually refers to the entire system.

Yes! An application just uses the Aircloak SQL endpoint and that's it.

sebastian commented 8 years ago

If we unify air and cloak in this process, and only deploy a single aircloak in front of the database, then we also get rid of the problems of having to support multiple distinct cloak versions in our air, since the air is tightly coupled and tied to the cloak.

obrok commented 8 years ago

We sort of lose the ability to connect multiple different databases on different sites in that case

sebastian commented 8 years ago

We sort of lose the ability to connect multiple different databases on different sites in that case

In terms of querying multiple databases nothing changes. Before, as also now, the same cloak instance would have to be connected to all the backends that should be queried. In the old setup no un-anonymized data ever hit the air, so the air couldn't combine results from distinct databases either.

I think this might in fact be less of an issue. Since we start supporting SQL queries, the query code is likely to live in the application that interprets the results, rather than in our interface. As a result, the need for a rich query interface in our tool is greatly diminished. If you as an analyst want to run the same query over multiple backends, you just point your local query client at the backends in question, and that's it.

obrok commented 8 years ago

What I meant is currently you can have a multiple-tenant air with every tenant having their own cloak for example, deployed in various physical locations.

sasa1977 commented 8 years ago

A nice benefit of the split system is that we are able to improve some parts of the system without deploying to clients. Examples include UI extensions and REST API. In fact, if we did query translation in the Air, we could also tweak the query language or support additional backends without needing to redeploy the cloak.

sebastian commented 8 years ago

What I meant is currently you can have a multiple-tenant air with every tenant having their own cloak for example, deployed in various physical locations.

True, but maybe what we really need is a zero-tenant air? As in, a minimal web interface for configuration and managing users is included in the cloak itself, and that's it...

A nice benefit of the split system is that we are able to improve some parts of the system without deploying to clients. Examples include UI extensions and REST API.

This is true! The question though is whether that is something we can still do later on top of having completely standalone cloaks? As in, the cloaks can be queried individually without an air, and then we can provide a richer experience through the insights platform that we can upgrade independently. Then the insights platform can become more of a traditional analytics platform for example. BUT if so, then that is definitively step 2 or step 3.

In fact, if we did query translation in the Air, we could also tweak the query language or support additional backends without needing to redeploy the cloak.

Yes, we could tweak the query language, but I am not sure how we would go about supporting additional backends?

sebastian commented 8 years ago

And think about all the problems that vanish if we combine them. There is no more issue about which entity knows what about the query execution state. We don't need to worry about whether or not an answer from the cloak reached the air, or whether a cloak did in fact receive a request from the air...

sasa1977 commented 8 years ago

And think about all the problems that vanish if we combine them. There is no more issue about which entity knows what about the query execution state. We don't need to worry about whether or not an answer from the cloak reached the air, or whether a cloak did in fact receive a request from the air...

This is an excellent point indeed! We don't have a distributed system anymore, so a lot of stuff can go away, especially in the Air. In fact, I wonder if we would need CoreOS at all?

sebastian commented 8 years ago

We don't have a distributed system anymore, so a lot of stuff can go away, especially in the Air.

👍

In fact, I wonder if we would need CoreOS at all?

:( good question. We certainly need it less CoreOS. For Telefonica we need some central management platform where users can be managed and given access to individual datasets. But that's going to be a very reduced version of what we had planned the insights platform should look like though! In fact it could probably be nothing more than some sort of proxy directly forwarding requests to the relevant backends...

sasa1977 commented 8 years ago

Yeah, I wonder if CoreOS is the best or simplest solution for the job. To be honest, I was never quite impressed with it. While it offers some nice goodies related to clustering, it also locks us into its technology stack. Moreover, it presents problems, because now we need to work with Docker version supported by CoreOS.

I wonder if we should just use plain Docker to run our system. Then it's up to customers' admins to run the container somewhere.

Also, as I argued back when we discussed CoreOS, I think we should rely as much as possible on Erlang for clustering, rather than using the magical layer of CoreOS for that. We could still use etcd (or some alternative k-v) for strong consistency, though even that is supposedly solved with riak_ensemble.

But it seems to me we're currently not even aiming to cluster cloaks, and if Air is to be integrated in the cloak, then I see little needs for CoreOS. Such change would be a good opportunity to simplify the deployment :-)

sebastian commented 8 years ago

Then it's up to customers' admins to run the container somewhere.

For the cloak I certainly agree! There it should just be docker and that's it.

Such change would be a good opportunity to simplify the deployment :-)

Yes – great chance to simplify even more! Agreed. Now, the thing is, we don't really need to deploy anything at the moment at all. Since with aircloak all in one component, we can just as well run everything locally for the time being...

cristianberneanu commented 8 years ago

I would like to raise here a tangential topic that we should keep in mind for the future:

Seems to me we have a tendency to invest a lot in the implementation (to the point of over-engineering) of systems with designs not-locked down. This is the second time we will throw away a lot of code (a good thing) that we spend a lot of time to write (a not so-good thing).

Having fluid design that has to be marked tested is normal in the early stages of a product. I suggest that in the future we should strive to rather fall on the side of doing the fast and dumb thing that gets us closer to having a functional prototype than to fall on the side of properly engineering a component.

sebastian commented 8 years ago

@cristianberneanu yes, I absolutely agree! There is so much we have to learn here still. Therefore, let's rather move faster and build something imperfect and learn. Throwing away experiments is significantly easier mentally, and way cheaper, than throwing away a perfectly built shack.

Let's aim to get a prototype SQL implementation up and running within the next two weeks, learn from it, expand it, and improve!

sasa1977 commented 8 years ago

Seems to me we have a tendency to invest a lot in the implementation (to the point of over-engineering) of systems with designs not-locked down.

I wouldn't exactly say this was the case. We were committed for two years (and more before I arrived) to a fairly well specified goal. Things have indeed changed radically at the beginning of the year, but before that happened, the target was fairly stable. Also, a functional prototype already existed when I arrived, and was demoed quite extensively. It helped us showcase our work, which in turn allowed us to better understand the needs of potential customers (and perhaps it helped them to understand what they want).

I suggest that in the future we should strive to rather fall on the side of doing the fast and dumb thing that gets us closer to having a functional prototype

We should certainly do as little as needed, I basically agree with you. But as I said the other day, I don't think simplicity is as easy as producing less LOC and quickly hacking something. Quick and hacky decisions made early in the game can haunt us for a long time, and can be hard to change later. I've seen it happen too many times.

I recently saw a nice quote: Err on the side of doing fewer things, but do them well. And this is what I think we should strive for.

cristianberneanu commented 8 years ago

"We now know that the privacy of our current Lua sandbox can be broken quite easily."

I should have probably asked this sooner, but how confident are you about this statement?

Is the attack scenario something likely to be encountered in practice?

sebastian commented 8 years ago

I should have probably asked this sooner, but how confident are you about "We now know that the privacy of our current Lua sandbox can be broken quite easily." this statement?

Is the attack scenario something likely to be encountered in practice?

Haha, good point :) Paul considers it low to medium effort to break. I.e. the Lua interface isn't something we could in good faith open to any real external third party. Especially considering it is something we know we have no clue how/if could be fixed (or rather Paul has tried to fix it unsuccessfully for a long time).

Any use case where the data owner isn't also the analyst would be very questionable. And any and all "open access" systems using our interface would be off limits.

Having people use Lua temporarily, to then shift them over to SQL is disruptive and counter productive.

yoid2000 commented 8 years ago

HI all. I've been "off the grid" for a few days, but back now.

I want to mention that we did not make the decision to move to the new approach until after a substantial fast-prototyping effort (based on a substantial design that CNIL, the french data protection authority, has informally approved). You can and should play with the prototype. It is pretty bare bones, and at the moment works with only a subset of SQL, but its anonymity properties are already quite good. There is an introductory tutorial for it at https://github.com/Aircloak/newCloakPrototype/wiki/Analyst-Interface-to-the-Cloak, which runs on a synthetic database. Currently it is working on three flavors of SQL: Postgres, Mysql, and SQL Server. You can find the code in the newCloakPrototype repository.

The syntax I use for the queries are a combination of SQL and an SQL-like query language called AQL (the A can stand for Anonymizing, Aggregation, or Aircloak, take your pick). The details of how to compose the queries are up for discussion...this is my best shot, but @sebastian has some different ideas.

I've also used the prototype for playing around with a real medical database. I need to update the documentation of that exercise, but I'll do that shortly.

By the way, in the next week or so I'll update the prototype so that virtually all restrictions on SQL are removed.

sebastian commented 8 years ago

Ok, we should avoid confusion here! We are already extensively using AQL for the dialect of SQL we are currently implementing, which is very distinct from the one you are using @yoid2000.

yoid2000 commented 8 years ago

Indeed. How about we refer to them as IAQL and SAQL, for "integrated" (what Seb has documented), and "segregated" (what is in the current prototype).

obrok commented 8 years ago

I think we should call the production implementation AQL - that will be a customer-facing name as well. Maybe the prototype implementation could be called PAQL?

sebastian commented 8 years ago

AQL and PAQL has my vote 👍

yoid2000 commented 8 years ago

yes, AQL for customer is good. We just need distinct internal names until we decide how we're going to implement it. I think iAQL and sAQL are good, since they are descriptive of the design philosophy underpinning the approaches. I prefer that we use iAQL and sAQL, rather than AQL and PAQL, because I don't want to suggest that sAQL (PAQL) is somehow a temporary thing and iAQL (AQL) is the expected final outcome. We and our customers need to evaluate the two.

Aircloak / aircloak

Query changes #133

What changes

What remains the same

What changes

Steps going forward

Two tier approach