I have found edge cases that transforming new unseen data will change the results in the 'outliers' dataframe on the original data used in fit, even with update_outlier_params=False. This specifically applies to the Hotelling T2 statistic only.
Digging into it, the cause it is the usage of all rows in the PC dataframe in hotellingsT2(), called by compute_outliers() from transform().
The hotellingsT2() function uses all rows of the PC dataframe to compute the outliers in the new data, and the results don't change for the calculation of _yscore (as the mean, var are locked), or _yproba, or even Pcomb variables.
But, the calculation of Pcorr using the multitest_correction() will be directly affected by using more rows than before, and it is this column that is compared to alpha to determine the _ybool column in results['outliers'].
So, in short, fitting data then transforming data with update_outlier_params=False will change the _yproba and _ybool of original fit data in certain cases.
I experimented and created a simple dummy data example that replicates this. To be fair, I'm not even sure this is a huge concern, but I figure that the expectation is that the outlier params of previously-fit data won't change if update_outlier_params=False. And it showed up in the usage I'm building.
This example changes the number of HotellingT2 outliers (as determined by _ybool) of original fit data from 1 to 0.
import numpy as np
import pandas as pd
from pca import pca
# Create dataset
np.random.seed(42)
X_orig = pd.DataFrame(np.random.randint(low=1, high=10, size=(10000, 10)))
# Insert Outliers
X_orig.iloc[500:510, 8:] = 15
# PCA Training
model = pca(n_components=5, alpha=0.05, n_std=3, normalize=True, random_state=42)
results = model.fit_transform(X=X_orig)
outliers_original = model.results['outliers']
# Create New Data
X_new = pd.DataFrame(np.random.randint(low=1, high=10, size=(1000, 10)))
# Transform New Data
model.transform(X=X_new, update_outlier_params=False)
outliers_new = model.results['outliers']
# Compare Original Points Outlier Results Before and After Transform
print("Before:", outliers_original['y_bool'].value_counts())
print("After:", outliers_new.iloc[:n_total]['y_bool'].value_counts())
I'm not sure what the fix is from a statistics standpoint, whether it's running the multitest differently or checking for changes, etc. But I wanted to raise the question.
I understand that inherently it makes sense for the _yproba to change for the previous data once more is added in, so it seems more a philosophical problem than a statistical one, but as someone tracking outliers as more and more data is transformed, it showed up.
I have found edge cases that transforming new unseen data will change the results in the 'outliers' dataframe on the original data used in fit, even with update_outlier_params=False. This specifically applies to the Hotelling T2 statistic only.
Digging into it, the cause it is the usage of all rows in the PC dataframe in hotellingsT2(), called by compute_outliers() from transform().
The hotellingsT2() function uses all rows of the PC dataframe to compute the outliers in the new data, and the results don't change for the calculation of _yscore (as the mean, var are locked), or _yproba, or even Pcomb variables.
But, the calculation of Pcorr using the multitest_correction() will be directly affected by using more rows than before, and it is this column that is compared to alpha to determine the _ybool column in results['outliers'].
So, in short, fitting data then transforming data with update_outlier_params=False will change the _yproba and _ybool of original fit data in certain cases.
I experimented and created a simple dummy data example that replicates this. To be fair, I'm not even sure this is a huge concern, but I figure that the expectation is that the outlier params of previously-fit data won't change if update_outlier_params=False. And it showed up in the usage I'm building.
This example changes the number of HotellingT2 outliers (as determined by _ybool) of original fit data from 1 to 0.
I'm not sure what the fix is from a statistics standpoint, whether it's running the multitest differently or checking for changes, etc. But I wanted to raise the question.
I understand that inherently it makes sense for the _yproba to change for the previous data once more is added in, so it seems more a philosophical problem than a statistical one, but as someone tracking outliers as more and more data is transformed, it showed up.