Aircloak / aircloak

This repository contains the Aircloak Air frontend as well as the code for our Cloak query and anonymization platform
2 stars 0 forks source link

Support for `VIEW`s #468

Closed sebastian closed 7 years ago

sebastian commented 8 years ago

Views in the Aircloak sense or not materialised and do not at all live in the database.

We allow analysts to pre-construct select statements (with the same semantic as any sub-query we would allow), and save them as named views. These views are per user, can be referenced like any other table would be, and inserted into the query at runtime.

For example, the following AQL could be saved as critical_patients:

SELECT uid, length(firstname) as name_length, age
  FROM patients INNER JOIN medical ON patients.id = medical.patient_id
WHERE firstname LIKE '%dam'

and then used in another query like this:

SELECT name_length / age as math, count(*) FROM critical_patients GROUP BY math

Depends on the native subquery support issue.


https://trello.com/c/oUdK0l5K/7159-support-for-view-s

sebastian commented 8 years ago

Let's put this one on hold for a little while, @sasa1977. I think we have other more pressing issues first, to get ready to the point where we are happy to let people use our system, rather than adding more features.

I'll un-assign you for now. Hope you don't mind.

sebastian commented 8 years ago

Not important. I'll close this for now.

sebastian commented 7 years ago

In fact this has turned out to be more useful than I thought. When wanting to segment a user base it's incredibly convenient!

For example, take the Games and Players database as an example. Say we wanted to write queries that only deal with the "winners", then we could count them like this:

SELECT count(*) FROM (
  SELECT t1.player as uid, (wins - losses) as wins
  FROM (
    SELECT player, count(*) as wins
    FROM games
    WHERE outcome = 'W'
    GROUP BY player
  ) t1 INNER JOIN (
    SELECT player, count(*) as losses
    FROM games
    WHERE outcome = 'L'
    GROUP BY player
  ) t2 ON t1.player = t2.player
  WHERE wins >= 0 and wins < 100000000000
  GROUP BY uid
) winners

Being able to save the segmentation section as a view would be incredibly useful, and would allows us to write the query as:

SELECT count(*) FROM winners

We could furthermore then combine multiple views to segment even further:

SELECT count(*)
FROM 
  winners INNER JOIN men ON winners.uid = men.uid
  INNER JOIN own_a_car ON men.uid = own_a_car.uid

We could:

sebastian commented 7 years ago

Maybe it's best to create an own subpage for views?

Views are not shared across users for the time being.

sasa1977 commented 7 years ago

Some comments here:

when a query uses a view, we can replace the mention of the view with the stored SQL

We'll certainly do this, but I think the only place we can currently do it is in Cloak. If we want to do it in Air, we'll need to parse the query there.

cloak's don't need to know about the views at all

Given my previous comment, cloaks will need to know about views, because cloak is the only one who can resolve the view reference properly. Since views are tied to a particular user, it means that as the part of the query request, the air would have to send all user defined views to the cloak. Cloak would then need to compile them as well, and could use them as allowed table references.

prior to saving we should run the query (as a subquery to get the correct validations in the compiler) and check that it works

Running the query might take a long time. It would be better to expand the interface of the cloak so that air can ask it whether the given view definition is valid, without actually running the query.

All of this actually makes me wonder if we should consider a different approach. Currently, cloak accepts query strings. However, cloak was never intended to be used by humans. So an alternative would be to move lexer and parser to the air. I believe we have most of the necessary data in the air, which would allow us to properly compile the query there. The cloak would then accept a compiled query and execute it.

The story is not so simple however, since cloak would still need to verify some things. Whatever is forbidden because of privacy aspects, needs to be double checked by the cloak. Otherwise someone could circumvent air and abuse cloak to compromise privacy. A fairly simple solution would be to do another compiler pass in the cloak. In this pass, no transformation would happen, but the same set of checks would run, thus ensuring that the query is valid.

The benefits of this approach are:

The main downside is that this requires a solid amount of work. Another issue is that whenever some additional feature is added to the cloak (e.g. additional function, or a new keyword), we need to redeploy cloak as well. Consequently, this increases coupling between the cloak and the air. Air and all of its cloaks need to be on the same version, otherwise some strange effects might occur.

So the decision is not simple. It seems to me that the added coupling between air and cloak is a serious issue. Therefore, I'd still advise to keep query parsing/compiling in the cloak. However, we should keep this idea in the back of our mind. The proposed steps are then:

ghost commented 7 years ago

cloak's don't need to know about the views at all

Given my previous comment, cloaks will need to know about views, because cloak is the
only one who can resolve the view reference properly. Since views are tied to a particular
user, it means that as the part of the query request, the air would have to send all user
defined views to the cloak. Cloak would then need to compile them as well, and could use
them as allowed table references.

You are absolutely right. The cloak needs to inline the view, not air.

prior to saving we should run the query (as a subquery to get the correct validations  in the compiler) and check that it works

Running the query might take a long time. It would be better to expand the interface of
the cloak so that air can ask it whether the given view definition is valid, without actually
running the query.

I did in fact just mean the validation part. I.e. query compilation, not running.

So the decision is not simple. It seems to me that the added coupling between air and cloak  is a serious issue. Therefore, I'd still advise to keep query parsing/compiling in the
cloak. However, we should keep this idea in the back of our mind. The proposed steps are
then:

  • sending all user views together with the query string
  • resolving views in the cloak
  • support view validation request-response message, so that air can ask the cloak to
    validate the view before saving it

It certainly is not an easy tradeof.  We would either have to require the air and cloak run with compatible versions, or we would have to introduce a strict versioning scheme in the protocol, which would then require that the air is able to provide queries for older cloak’s too. Neither is nice.

I approve of your current plan of action (as listed above) as the right option for now.

Thanks, Sebastian

sasa1977 commented 7 years ago

add a button in the query interface allowing a query to be saved as a view

There is one subtle issue here. Queries written by analysts usually won't contain uid. However, our view needs to select the uid column. We can implicitly add this column during the compilation. However, it won't be clear to the analyst what is the name of the uid column.

One solution for this is to also include all views in SHOW TABLES and support SHOW COLUMNS for views. For example, let's say that the analyst issues a query SELECT name FROM users, and then saves this query as the view names. Invoking SHOW COLUMNS FROM names would return two columns: uid, and name. This seems like a fairly consistent user experience. What do you think?

sebastian commented 7 years ago

It needs to be made very clear that the semantics of views are the exact same as those of subqueries. Hence a uid-column has to be explicitly selected.

Since a view is based on a real table, the analyst knows which column is the UID column that needs to be selected. Hence we know it for views too. And as a result we can also make views based on views!

I would very much like views to appear when SHOW tables and SHOW columns FROM <view> are issued, but more than that I would also like them to be shown in the side-panel amongst the tables.