supporting ML scenarios through Aircloak

sasa1977 commented 6 years ago

This is a summary of my current thoughts on supporting machine learning through Aircloak. I should note that I'm not very familiar with ML, so I might be missing some important points. Feel free to correct me if I made some incorrect statements :-)

The overall goal is to improve ML scenarios on the data obtained through Aircloak. In this discussion I'll only focus on the supervised ML, since that's all I've looked into so far.

The high-level simplified story of the supervised learning is that the machine builds a prediction model based on some input dataset. In the dataset, we have one or more input fields, and we have one or more output fields. The machine crunches the dataset and spits out a model which can be used to predict the output values for any combination of input values. For example, in the input dataset, we could have a bunch of boolean fields such as fever, headache, ..., and the output could be the diagnosis field (e.g. flu, diabetes, ...).

There are of course a bunch of algorithms for this, and there are various off-the-shelf solution. However, the problem with such solutions is that we need to fetch all the data through the Aircloak, and this could likely lead to everything being anonymized away. Considering the example above, let's say we have the table called patients and we do select * from patients. If there are some highly unique fields in the table (for example SSN), we won't get any meaningful results to build a reliable prediction model.

A simple fix can be done if an analyst is familiar with the data. The analyst could only select the relevant fields (such as symptom fields and the diagnosis field). However, in many cases this might not be a viable option. Even when it is possible to determine relevant input fields, we'd still lose some fidelity in the input set, because analysts need to select all the input fields at once. I'm currently not sure if this is necessarily a bad thing, because from the standpoint of the machine learning, we don't want outlier combinations to affect the model. So perhaps the fact that we're masking infrequent combinations is in fact a good thing.

Either way it's worth considering whether we can somehow improve the reliability of the prediction model. I'll discuss a couple of ideas here.

Provide our own external Aircloak-aware ML implementations

In this proposal, we implement ML algorithms ourselves, trying to take advantage of how data is best retrieved from Aircloak. From what I can tell, in most supervised ML implementations, the algorithm is repeatedly trying to figure out which field leads to the "best" data split. For example, if we're trying to figure out the symptoms for a flu, the fever field might be a much better splitter than toothache.

In order to determine which field is a better splitter, we need to correlate each input field with the output field. Therefore, instead of doing select fever, toothache, diagnosis, we could issue two selects: select fever, diagnosis and select toothache, diagnosis, which might give us more data to work with.

On the flip side, I see a bunch of challenges here:

Implementing ML algorithms from scratch is quite a task. There are various different algorithms, and most of them have additional variations (for example the criteria to determine the best splitter) and parameters.
The proposed implementation above would likely be quite slow, as we repeatedly have to refetch and anonymize the data, then ship it through Air, process the result, and then rinse & repeat.
Irrespective of the Aircloak specific overhead, the algorithm itself will likely need to be fine tuned so it can work with larger input sets.
All of the points above become even more complex if you consider continuous inputs and outpus (i.e. numbers and not discrete booleans or enumerations)
Some algorithms, such as Tree bagging and Random forests, would need additional support from the cloak, because they need to repeatedly analyze the same sample of the input set.
It's not clear to me whether this Aircloak specific querying would actually confuse the algorithm and lead to a less reliable prediction model.

So considering these issues, my current impression is that this is a quite huge undertaking, with a very uncertain outcome. While I have no doubt that we can make it work somehow, I'm worried about the usefulness of the produced solution. Given that Aircloak itself is slow, I have some serious concerns whether our solution which repeatedly issues a bunch of queries to the Aircloak will be able to produce anything remotely usable within acceptable time.

Considering that we also need to reinvent the whole ML implementation from scratch, which would likely be less mature, less feature rich, less reliable, and less performant (even without considering the Aircloak performance), I'm not convinced that this is the way to go.

Proxying off-the-shelf ML implementations in the Cloak

In this approach we would somehow allow analysts to produce the prediction model directly in the cloak. Hence, instead of providing an external tool, the analyst would issue a query, such as CREATE PREDICTION MODEL FROM ....

Internally, the cloak would fetch the raw data, and then feed it to some off-the-shelf solution. We would likely need to support different algorithms, and various options. The outcome of the action is a prediction model which can be fed with new inputs to obtain predictions.

The question here is can we safely return the prediction model to the analyst, i.e. does prediction model keep privacy guarantees. My current impression is that it doesn't, at least not for random forests, where each particular input combination is covered in some tree.

One way of dealing with this is to keep the model stored inside the cloak. This would mean that the analysts can only obtain predictions by sending new inputs to cloak. I'm not sure how useful would this be in practice though. More likely, analysts are accustomed to using their own tools, which are likely feature rich beyond what we can achieve with a reasonable effort.

Another option is to try and sanitize the obtained prediction model before sending it back. The purpose of the sanitization is to remove all the branches which can compromise an individual (or a small group). I'm not completely sure whether this resolves all the privacy issues though.

Maximizing the useful data returned from regular queries

As mentioned earlier, the root cause of or ML-related problem is the fact that with select * usually all (or most) of the data is anonymized away, so the result becomes useless for further training.

Instead of trying to provide an out-of-the-box ML solution, we could focus on returning as much of the useful data as possible. We already have a very basic hardcoded support for that, by premasking user_id fields, so that select * from table might still end up returning some meaningful data, even though every selected row is unique.

In the extension of this idea, the cloak would somehow figure out which fields should be premasked to increase the amount of the non-masked data returned. The benefit is then that analysts can issue their select from an arbitrary table (or join), get their data and use the result with any desired tool for any purpose, such as supervised or unsupervised ML. In other words, by improving the usefulness of the returned set, we can solve many problems.

On the flipside, this will probably also require a sophisticated implementation (which might even end up using some of the ML techniques itself), it would also have its own performance penalty, and ultimately might not give optimal results.

That said, I feel that this approach has by far the best value-to-effort ratio compared to the previous two.

obrok commented 6 years ago

One way of dealing with this is to keep the model stored inside the cloak. This would mean that the analysts can only obtain predictions by sending new inputs to cloak. I'm not sure how useful would this be in practice though. More likely, analysts are accustomed to using their own tools, which are likely feature rich beyond what we can achieve with a reasonable effort.

Another option is to try and sanitize the obtained prediction model before sending it back. The purpose of the sanitization is to remove all the branches which can compromise an individual (or a small group). I'm not completely sure whether this resolves all the privacy issues though.

My immediate thought is that this is impossible without a significant effort put into understanding these models and maybe developing a privacy-specific prediction model. The prediction model "compresses" the knowledge about the domain, so it's difficult to impossible to tweak this compressed representation so that no privacy is leaked. Even if we only allow querying the model from outside that seems like very easily exploitable - you teach the model on only a small amount of data that you want to attack and then when you query it, it will just replay the exact values.

In the extension of this idea, the cloak would somehow figure out which fields should be premasked to increase the amount of the non-masked data returned. The benefit is then that analysts can issue their select from an arbitrary table (or join), get their data and use the result with any desired tool for any purpose, such as supervised or unsupervised ML. In other words, by improving the usefulness of the returned set, we can solve many problems.

We could also give the analyst a suite of functions that would allow them to figure out which fields need to be masked or not, leaving the final decision in their hands, but speeding up the process.

sebastian commented 6 years ago

Some algorithms, such as Tree bagging and Random forests, would need additional support from the cloak, because they need to repeatedly analyze the same sample of the input set.

Not necessarily true? Some layer could cache results for queries, but this doesn't have to be the cloak nor the air. It could be an external implementation altogether.

It's not clear to me whether this Aircloak specific querying would actually confuse the algorithm and lead to a less reliable prediction model.

This is something we need to analyze and understand.

The proposed implementation above would likely be quite slow, as we repeatedly have to refetch and anonymize the data, then ship it through Air, process the result, and then rinse & repeat.

I don't think speed is necessarily the main issue here. If you get good results at a cost in speed with the benefit of being compliant, then it might be a decent trade-of. Building these models is slow to start with.

Considering that we also need to reinvent the whole ML implementation from scratch

We don't want to re-invent anything! That's the crucial part. We take off the shelf algorithms (I think most are quite heavily studied and published) and adapt the data access part. But of course using off-the-shelf tools would be better.

One way of dealing with this is to keep the model stored inside the cloak. This would mean that the analysts can only obtain predictions by sending new inputs to cloak

I don't actually think this addresses the privacy concerns. Yes, it makes it harder to see what the parameters are, but I think you could still gleam them through tuning the input parameters you pass to the engine.

Another option is to try and sanitize the obtained prediction model before sending it back. The purpose of the sanitization is to remove all the branches which can compromise an individual (or a small group). I'm not completely sure whether this resolves all the privacy issues though.

It probably could, but that requires an exceptional understanding of the internals of the models built by these off the shelf tools.

We are (Felix, Paul and I) playing with the idea of offering Principal Components Analysis and other correlation mechanisms in the cloak. In general having features that allow an analyst to understand which columns are relevant and which not, would allow them to select those that provide value rather than those that just add noise and distort the anonymized outputs. However even so you are going to get quite severe anonymization by loading out all columns at once, rather than iteratively increasing the number of columns you request.

sasa1977 commented 6 years ago

We could also give the analyst a suite of functions that would allow them to figure out which fields need to be masked or not, leaving the final decision in their hands, but speeding up the process.

Yeah, this sounds like a very interesting idea.

We are (Felix, Paul and I) playing with the idea of offering Principal Components Analysis and other correlation mechanisms in the cloak. In general having features that allow an analyst to understand which columns are relevant and which not, would allow them to select those that provide value rather than those that just add noise and distort the anonymized outputs.

This looks like the variation of what Pawel mentioned above, right?

However even so you are going to get quite severe anonymization by loading out all columns at once, rather than iteratively increasing the number of columns you request.

We might also get severe anonymization by iteratively increasing the number of columns too. Moreover, if we do get such severe anonymization, then the question is can anything be reliably concluded from the related input fields in the first place?

Some algorithms, such as Tree bagging and Random forests, would need additional support from the cloak, because they need to repeatedly analyze the same sample of the input set.

Not necessarily true? Some layer could cache results for queries, but this doesn't have to be the cloak nor the air. It could be an external implementation altogether.

My understanding of these two algorithms is that for every tree you pick a sample and work on that same sample. Since the idea of external implementation is to repeatedly issue queries while building the tree, you have two options:

Fetch all the fields at once (which completely eliminates the need for our own solution)
Support repeated requreying of the same sample, and the sampling is done with replacement.

I don't think speed is necessarily the main issue here. If you get good results at a cost in speed with the benefit of being compliant, then it might be a decent trade-of. Building these models is slow to start with.

And so is querying Aircloak. Combining the two in the "reimplemented external ML" approach we'll likely end up with extreme slowness, since we have to repeatedly requery the cloak and anonymize the data. With random forrests, where you need to create multiple trees I fear this is going to be useless.

We don't want to re-invent anything! That's the crucial part. We take off the shelf algorithms (I think most are quite heavily studied and published) and adapt the data access part. But of course using off-the-shelf tools would be better.

We're certainly reimplementing non-trivial algorithms. Moreover, my impression is that there isn't really one "canonical" algorithm, but rather that there are a bunch of variations (including which subalgorithms are used) of the general approach. In fact, there isn't even a single algorithm, but rather a family of algorithms (decision trees, regression trees, random forests, ...), and we might end up needing to support more than one.

cristianberneanu commented 6 years ago

Instead of trying to provide an out-of-the-box ML solution, we could focus on returning as much of the useful data as possible.

This, to me, sounds somewhat similar to what the new low-count filtering algorithm is doing (i.e. trying to minimize the amount of censored data). The current version does a linear pass through all the columns, in the order of selection, while here, a version that tries all possible permutations would be needed. That won't be very fast though and the question still remains on how to evaluate the amount of data outputted by a permutation.

sebastian commented 6 years ago

We are (Felix, Paul and I) playing with the idea of offering Principal Components Analysis and other correlation mechanisms in the cloak. In general having features that allow an analyst to understand which columns are relevant and which not, would allow them to select those that provide value rather than those that just add noise and distort the anonymized outputs.

This looks like the variation of what Pawel mentioned above, right?

Related in one sense, but very different in another. The problem is that an analyst often times has no clue which columns are important and which not. As a result the analyst cannot properly manually select which ones should be part of a query. Trying out all permutations manually is a pain (this is what a custom random forest implementation could do for the analyst), and selecting too many columns yields poor results.

We need something like a correlation analysis or dimension reduction analysis which has the cloak determine which columns are correlated with some desired output column and then helps the analyst select only those.

sebastian commented 6 years ago

My concern with PCA straight off the bat is that you need to output these eigenvalues alongside the new dimensions that (if I understand it right) show how the transformation was done. I think those values would be sensitive to extreme value and would cause a privacy risk.

cristianberneanu commented 6 years ago

After thinking about it a bit more, we might not need to compute all permutations. In order to maximize the number of outputted values, it might be enough to just sort the columns by the number of distinct values present in them.

sebastian commented 6 years ago

After thinking about it a bit more, we might not need to compute all permutations. In order to maximize the number of outputted values, it might be enough to just sort the columns by the number of distinct values present in them.

Oh, that's a neat idea. Can you give it a try?

cristianberneanu commented 6 years ago

Oh, that's a neat idea. Can you give it a try?

First of all, we can't do this in the general case, as we need to keep the order of the columns as specified in the query (maybe only when * is specified?).

Second, I think my approach was somewhat too simplistic.

For example, numbers (especially reals) are problematic in the sense that they can have lots of different values, even if most of those are somewhat clustered. For those, we would need a way to automatically determine the best cluster size, something which is not clear to me how is best done.

Other columns might be irrelevant, but because they have few distinct values would be put first in the list and the lead to extra fragmentation of the buckets.

Is this urgent? Do you want me to stop work on the performance and play with this right now?

sebastian commented 6 years ago

For example, numbers (especially reals) are problematic in the sense that they can have lots of different values, even if most of those are somewhat clustered. For those, we would need a way to automatically determine the best cluster size, something which is not clear to me how is best done.

Let's not do automatic bucketization for the time being. If at all it should be something the analyst explicitly asks for, rather than something we do by default. That fact that we do not distort dimensions is a big win for many people I talk to.

Is this urgent? Do you want me to stop work on the performance and play with this right now?

No, not urgent. Let's ignore it altogether for now.

cristianberneanu commented 6 years ago

Let's not do automatic bucketization for the time being. If at all it should be something the analyst explicitly asks for, rather than something we do by default. That fact that we do not distort dimensions is a big win for many people I talk to.

I wasn't suggesting that we do bucketization on the entire set of values, only on the low-count filtered buckets. Initially, we grouped all LCF buckets into one, global censored bucket. Now, we censor each column individually and group LCF buckets into multiple censored buckets.

We can take this even further and censor each digit/letter individually for the LCF buckets. Right now, the entire value is censored, so there would be an increase in the amount data extracted (at the cost of more post-processing).

For example, if we have the following bucket list:

column 1	column 2
true	abc1
true	abc2
true	abc1122
false	xyz000
false	xyz0101
false	xyz1234

The current algorithm might return:

column 1	column 2
true	*
false	*

While the improved algorithm might return:

column 1	column 2
true	abc*
false	xyz*

The same would be true for numbers were decimals would be dropped one-by-one until the bucket passes the LCF. This would not be optimal in all cases, but it might increase the amount of extracted data on some scenarios (at the cost of making the query slower).

sebastian commented 6 years ago

True, we could make it give out more information step by step. Would be neat. Let's however focus on performance first at this time.

sebastian commented 6 years ago

Step 1 on the road to greater ML support is PCA: https://github.com/Aircloak/aircloak/issues/2481#issuecomment-374240401

Aircloak / aircloak