Aircloak / aircloak

This repository contains the Aircloak Air frontend as well as the code for our Cloak query and anonymization platform
2 stars 0 forks source link

Supporting PCA #2484

Open sebastian opened 6 years ago

sebastian commented 6 years ago

We want to support Principle Component Analysis as an advanced analytics option.

For an intro to PCA and how it looks in R it's worth looking at these three videos:

The way I envision the feature working:

SELECT   
  user_id,  
  pca([col1, col2, col3], [eigenvector1,...]) as pc1, -- eigenvalue: eigenvalue1
  pca([col1, col2, col3], [eigenvector2,...]) as pc2, -- eigenvalue: eigenvalue2
  ...
FROM original_input_table_or_source

Let's keep the scope limited for now.

Things that need to be done first:

In parallel we can in parallel do the following:

Then we iterate from there and get fancy as we get smarter.

sebastian commented 6 years ago

Just as a note for the future: while normal PCA is for continuous variables, there are "equivalent"/alternatives for data sets including non-linear components/categorical values. See this stackexchange question.


Also: I think it would be worth looking for a pure C/C++ implementation that we can use from Elixir. That way we don't need to copy all the data into Elixir, then out to some external process... much better to use some implementation that can share the memory space of our Elixir process!

Some more potentially useful information:

sasa1977 commented 6 years ago

I looked a bit at how outliers play a role in PCA. It appears that PCA can be sensitive to extreme values.

Here's an example of data with 90% of datapoints with high variables correlation, and 10% outliers where variables are not correlated:

image

The two pairs of arrows represent PCA computed only with correlated data (PC* cor) and with all the data, including outliers. What we can see is that with outliers, the principal components are swapped. I experimented with a smaller ratio of outliers, and my impression is that below 5% the outliers don't affect the chosen principal components.

I also did a quick/dirty experiment where I replaced each 10 percentile extreme value with a random value from the next 10 percentile block.

This is what I've got with 1% outliers:

image

As can be seen, this "smoothens" out the outliers, but doesn't remove them. The PCs seem to be properly chosen. However, biplots will still differ. Here's an example where we have 100% correlation (disregarding the outliers):

image

With outliers replaced from the next/previous percentile block:

image

The main thing to see here is that in the first diagram it is properly concluded that variables are perfectly correlated, while in the second diagram the correlation is present, but not perfect.

It's also worth pointing out that from the standpoint of PCA an outlier doesn't have to contain extreme values. An outlier could also be a single datapoint where values are in the range, but their correlation is way different from the majority.

My impression based on these simple experiments and casual reading is that PCA is highly sensitive to deviating datapoints, and that there's no one true way to deal with it. Some techniques I've seen mentioned are the Tukey method (see example here), and robust PCA.

sebastian commented 6 years ago

@yoid2000 please have a look here. As @sasa1977 points out above, PCAs aren't trivial by any means. The problem is that a value can be an outlier without being an outlier in any single dimension. @sasa1977's gave me the example offline of a situation that is easy to visualize where all values have x = y in the range from 0 to 100, and we have an outlier with a value of x = 10 and y 30. Clearly that is an outlier, but it's nothing a naïve approach to outlier detection would ever be able to detect.

One can do some homegrown solutions like:

but these types of ad-hoc approaches are bound to hide some problem as a result of us not fully understanding the implications...

Any ideas would be welcome!

sasa1977 commented 6 years ago

Here's a quick demo.

Data with 100 perfectly correlated datapoints and 1 outlier:

image

Biplot without outlier:

image

Biplot with outlier:

image

So the point is that in the first biplot, the vectors for var1 and var2 are the same, whereas in the second biplot they are different.

R code is available here.

sebastian commented 6 years ago

@Aircloak/developers please feel free to chime in here the rest of you too!

sasa1977 commented 6 years ago

I think that the core challenge here is figuring out how to compute covariance and correlation without doing count based filtering, but with retaining privacy guarantees.

Covariance between two columns is computed with the following formula:

cov(X,Y) = sum((Xi - mean(X)) * (Yi - mean(Y))) / (n-1)

Where Xi and Yi is the i-th value, and n is the number of elements.

The formula for correlation is:

corr(X,Y) = cov(X,Y) / (stddev(X) * stddev(Y))

In both properties, the presence or the absence of a single datapoint affects the value, which is the privacy risk. This is particularly problematic for outlier datapoints which deviate from the correlation of the majority. In such cases, a single datapoint can visibly change the result.

Some ideas Sebastian and I have been throwing around:

It should also be mentioned that the formulas above require loading the entire dataset into memory (or alternatively making two passes through all the rows). Since we have to compute covariance or correlation for all pairs of columns, this might present a problem for large datasets. One way to deal with it would be to sample the dataset. I'm mentioning this here because sampling might be something that could help us introduce the controlled noise.

sebastian commented 6 years ago

Hm, I wonder what the result would be if we used our anonymization algorithms for the sum, mean and stddev? It would distort the covariance and correlation of course, but I wonder in what way...

It could be that doing a two pass PCA is in fact enough...

sasa1977 commented 6 years ago

I wonder what the result would be if we used our anonymization algorithms for the sum, mean and stddev

I'm not sure this will by itself be enough to protect the privacy. In our current flow, we anonymize these functions after we applied LCF. In the PCA scenario, we explicitly want to avoid LCF.

That said, my intuition is that we should probably remove extreme values, as they can affect the result of stddev significantly. But even with extremes removed, we're still left with the problem where a few outliers can affect the result.

Perhaps instead of standard LCF, we could consider making some classification in an n-dimensional space, and then remove too small clusters? However, I'm not sure that this would give us better results if the dataset contains a lot of uncorrelated dimensions.

sebastian commented 6 years ago

Some more input from Carsten (SAP)

Hi Sebastian,
 
PCA is indeed very prone to observations in the tails of the distributions.
 
For starters, I would drop the top and bottom 5% of observations, and then do PCA.

Another possibility would be to decompose a robust correlation matrix. “Robust” means 
that extreme values do not distort such measures. There are some variants of robust 
correlation computation, e.g.
http://www.stat.tugraz.at/AJS/ausg111+2/111+2Shevlyakov.pdf, 
equation (4), where the mean is substituted by the median.
 
Hope this helps
Best
Carsten
sebastian commented 6 years ago

I have sent Carsten a follow-on question to clarify the top and bottom removal. More concretely if he means doing this on each dimension individually, and if so whether the whole observation (row) is removed as well, or if it's a matter of replacing the extreme values with other representative values.

sebastian commented 6 years ago

Look here: https://github.com/Aircloak/statistical-science/blob/master/notebooks/principal%20components%20analysis.ipynb

Andreas (who is helping us with stats and machine learning) did some initial PCA work which looks very promising.

sebastian commented 5 years ago

Could an Incremental PCA be used in order to reduce the memory pressure? Say we stream the data through our system rather than load the entirety out.