Supporting PCA - Githubissues

sebastian commented 6 years ago

We want to support Principle Component Analysis as an advanced analytics option.

For an intro to PCA and how it looks in R it's worth looking at these three videos:

Part 1: https://www.youtube.com/watch?v=5zk93CpKYhg
Part 2: https://www.youtube.com/watch?v=I5GxNzKLIoU
Using built-in functions – https://www.youtube.com/watch?v=oZ2nfIPdvjY

The way I envision the feature working:

On the data source page we add a tabbed menu at the top where the analyst can select between descriptive and advanced analytics (descriptive is what we do today)
On the advanced analytics page we offer PCA as the option, and show past PCAs that have been done
The analyst can click a button to have a new analysis done
The analyst chooses a table or view that should be the input for the analysis
The analyst chooses a subset of the columns in the selected table/view that should be used
The analyst then clicks "Run analysis" or something equivalent upon which we run an off the shelf PCA analysis on the raw data in the cloak
Upon completion we show a summary of the useful information. What exactly we show we need to evolve in collaboration with analysts, but a sane start seems to be the:
- eigenvectors
- the eigenvalues
- a bi-plot of the two first principle components
We can allow the analyst to save an analysis for later perusal
We allow the analyst to create a new view based on the analysis allowing the analyst to query the content of the principle components. The view could look something like what is shown below
What would be really cool is if we showed an anonymized heatmap of value frequencies within the bi-plot too

SELECT   
  user_id,  
  pca([col1, col2, col3], [eigenvector1,...]) as pc1, -- eigenvalue: eigenvalue1
  pca([col1, col2, col3], [eigenvector2,...]) as pc2, -- eigenvalue: eigenvalue2
  ...
FROM original_input_table_or_source

Let's keep the scope limited for now.

Things that need to be done first:

[ ] What do we need to do to make this safe? The current thinking is that it's sufficient to do extreme value replacement for each of the selected dimensions individually

In parallel we can in parallel do the following:

Allow a PCA to be run over the numerical columns only
Use an of the shelf PCA implementation
Show the principle components along with eigenvalues and eigenvectors

Then we iterate from there and get fancy as we get smarter.

sebastian commented 6 years ago

Just as a note for the future: while normal PCA is for continuous variables, there are "equivalent"/alternatives for data sets including non-linear components/categorical values. See this stackexchange question.

Also: I think it would be worth looking for a pure C/C++ implementation that we can use from Elixir. That way we don't need to copy all the data into Elixir, then out to some external process... much better to use some implementation that can share the memory space of our Elixir process!

https://github.com/mlpack/mlpack (implements a lot of algorithms that could come in handy)
https://sourceforge.net/projects/libpca/
https://en.wikipedia.org/wiki/Principal_component_analysis#Software/source_code

Some more potentially useful information:

http://www.gastonsanchez.com/visually-enforced/how-to/2012/06/17/PCA-in-R/

sasa1977 commented 6 years ago

I looked a bit at how outliers play a role in PCA. It appears that PCA can be sensitive to extreme values.

Here's an example of data with 90% of datapoints with high variables correlation, and 10% outliers where variables are not correlated:

The two pairs of arrows represent PCA computed only with correlated data (PC* cor) and with all the data, including outliers. What we can see is that with outliers, the principal components are swapped. I experimented with a smaller ratio of outliers, and my impression is that below 5% the outliers don't affect the chosen principal components.

I also did a quick/dirty experiment where I replaced each 10 percentile extreme value with a random value from the next 10 percentile block.

This is what I've got with 1% outliers:

As can be seen, this "smoothens" out the outliers, but doesn't remove them. The PCs seem to be properly chosen. However, biplots will still differ. Here's an example where we have 100% correlation (disregarding the outliers):

With outliers replaced from the next/previous percentile block:

The main thing to see here is that in the first diagram it is properly concluded that variables are perfectly correlated, while in the second diagram the correlation is present, but not perfect.

It's also worth pointing out that from the standpoint of PCA an outlier doesn't have to contain extreme values. An outlier could also be a single datapoint where values are in the range, but their correlation is way different from the majority.

My impression based on these simple experiments and casual reading is that PCA is highly sensitive to deviating datapoints, and that there's no one true way to deal with it. Some techniques I've seen mentioned are the Tukey method (see example here), and robust PCA.

sebastian commented 6 years ago

@yoid2000 please have a look here. As @sasa1977 points out above, PCAs aren't trivial by any means. The problem is that a value can be an outlier without being an outlier in any single dimension. @sasa1977's gave me the example offline of a situation that is easy to visualize where all values have x = y in the range from 0 to 100, and we have an outlier with a value of x = 10 and y 30. Clearly that is an outlier, but it's nothing a naïve approach to outlier detection would ever be able to detect.

One can do some homegrown solutions like:

running PCA
detecting outliers as being those with the largest distance from the PC1 and suppressing those
then rerunning the PCA

but these types of ad-hoc approaches are bound to hide some problem as a result of us not fully understanding the implications...

Any ideas would be welcome!

sasa1977 commented 6 years ago

Here's a quick demo.

Data with 100 perfectly correlated datapoints and 1 outlier:

Biplot without outlier:

Biplot with outlier:

So the point is that in the first biplot, the vectors for var1 and var2 are the same, whereas in the second biplot they are different.

R code is available here.

sebastian commented 6 years ago

@Aircloak/developers please feel free to chime in here the rest of you too!

sasa1977 commented 6 years ago

I think that the core challenge here is figuring out how to compute covariance and correlation without doing count based filtering, but with retaining privacy guarantees.

Covariance between two columns is computed with the following formula:

cov(X,Y) = sum((Xi - mean(X)) * (Yi - mean(Y))) / (n-1)

Where Xi and Yi is the i-th value, and n is the number of elements.

The formula for correlation is:

corr(X,Y) = cov(X,Y) / (stddev(X) * stddev(Y))

In both properties, the presence or the absence of a single datapoint affects the value, which is the privacy risk. This is particularly problematic for outlier datapoints which deviate from the correlation of the majority. In such cases, a single datapoint can visibly change the result.

Some ideas Sebastian and I have been throwing around:

removing extreme values
injecting random datapoints to add noise
doing internal PCA to identify and remove outliers

It should also be mentioned that the formulas above require loading the entire dataset into memory (or alternatively making two passes through all the rows). Since we have to compute covariance or correlation for all pairs of columns, this might present a problem for large datasets. One way to deal with it would be to sample the dataset. I'm mentioning this here because sampling might be something that could help us introduce the controlled noise.

sebastian commented 6 years ago

Hm, I wonder what the result would be if we used our anonymization algorithms for the sum, mean and stddev? It would distort the covariance and correlation of course, but I wonder in what way...

It could be that doing a two pass PCA is in fact enough...

sasa1977 commented 6 years ago

I wonder what the result would be if we used our anonymization algorithms for the sum, mean and stddev

I'm not sure this will by itself be enough to protect the privacy. In our current flow, we anonymize these functions after we applied LCF. In the PCA scenario, we explicitly want to avoid LCF.

That said, my intuition is that we should probably remove extreme values, as they can affect the result of stddev significantly. But even with extremes removed, we're still left with the problem where a few outliers can affect the result.

Perhaps instead of standard LCF, we could consider making some classification in an n-dimensional space, and then remove too small clusters? However, I'm not sure that this would give us better results if the dataset contains a lot of uncorrelated dimensions.

sebastian commented 6 years ago

Some more input from Carsten (SAP)

Hi Sebastian,
 
PCA is indeed very prone to observations in the tails of the distributions.
 
For starters, I would drop the top and bottom 5% of observations, and then do PCA.

Another possibility would be to decompose a robust correlation matrix. “Robust” means 
that extreme values do not distort such measures. There are some variants of robust 
correlation computation, e.g.
http://www.stat.tugraz.at/AJS/ausg111+2/111+2Shevlyakov.pdf, 
equation (4), where the mean is substituted by the median.
 
Hope this helps
Best
Carsten

sebastian commented 6 years ago

I have sent Carsten a follow-on question to clarify the top and bottom removal. More concretely if he means doing this on each dimension individually, and if so whether the whole observation (row) is removed as well, or if it's a matter of replacing the extreme values with other representative values.

sebastian commented 6 years ago

Look here: https://github.com/Aircloak/statistical-science/blob/master/notebooks/principal%20components%20analysis.ipynb

Andreas (who is helping us with stats and machine learning) did some initial PCA work which looks very promising.

sebastian commented 5 years ago

Could an Incremental PCA be used in order to reduce the memory pressure? Say we stream the data through our system rather than load the entirety out.

Aircloak / aircloak

Supporting PCA #2484