erdogant / pca

pca: A Python Package for Principal Component Analysis.
https://erdogant.github.io/pca
MIT License
284 stars 42 forks source link

Transform with update_outlier_params=False will still change Hotelling T2 outlier results on the fit data #54

Open lambdatascience opened 5 months ago

lambdatascience commented 5 months ago

I have found edge cases that transforming new unseen data will change the results in the 'outliers' dataframe on the original data used in fit, even with update_outlier_params=False. This specifically applies to the Hotelling T2 statistic only.

Digging into it, the cause it is the usage of all rows in the PC dataframe in hotellingsT2(), called by compute_outliers() from transform().

The hotellingsT2() function uses all rows of the PC dataframe to compute the outliers in the new data, and the results don't change for the calculation of _yscore (as the mean, var are locked), or _yproba, or even Pcomb variables.

But, the calculation of Pcorr using the multitest_correction() will be directly affected by using more rows than before, and it is this column that is compared to alpha to determine the _ybool column in results['outliers'].

So, in short, fitting data then transforming data with update_outlier_params=False will change the _yproba and _ybool of original fit data in certain cases.

I experimented and created a simple dummy data example that replicates this. To be fair, I'm not even sure this is a huge concern, but I figure that the expectation is that the outlier params of previously-fit data won't change if update_outlier_params=False. And it showed up in the usage I'm building.

This example changes the number of HotellingT2 outliers (as determined by _ybool) of original fit data from 1 to 0.

import numpy as np
import pandas as pd

from pca import pca

# Create dataset
np.random.seed(42)
X_orig = pd.DataFrame(np.random.randint(low=1, high=10, size=(10000, 10)))
# Insert Outliers
X_orig.iloc[500:510, 8:] = 15

# PCA Training
model = pca(n_components=5, alpha=0.05, n_std=3, normalize=True, random_state=42)
results = model.fit_transform(X=X_orig)

outliers_original = model.results['outliers']

# Create New Data
X_new = pd.DataFrame(np.random.randint(low=1, high=10, size=(1000, 10)))

# Transform New Data
model.transform(X=X_new, update_outlier_params=False)
outliers_new = model.results['outliers']

# Compare Original Points Outlier Results Before and After Transform
print("Before:", outliers_original['y_bool'].value_counts())
print("After:", outliers_new.iloc[:n_total]['y_bool'].value_counts())

I'm not sure what the fix is from a statistics standpoint, whether it's running the multitest differently or checking for changes, etc. But I wanted to raise the question.

I understand that inherently it makes sense for the _yproba to change for the previous data once more is added in, so it seems more a philosophical problem than a statistical one, but as someone tracking outliers as more and more data is transformed, it showed up.

erdogant commented 4 months ago

Thank you for observing and mentioning this. I need to chew on this a bit.