erdogant / pca

pca: A Python Package for Principal Component Analysis.
https://erdogant.github.io/pca
MIT License
284 stars 42 forks source link

Enable to collect parameter in T2 and SPE and reuse in the future for monitoring purpose (quality control chart context) #16

Closed hovinh closed 3 years ago

hovinh commented 3 years ago

This PR is related to the Issue #15. Problem statement: To employ pca package as a monitoring method, in form of a quality control chart.

Changes I have made:

Code to test out the new change:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from pca import pca

np.random.seed(42)
# Load dataset
n_total, train_ratio = 10000, 0.8
n_features = 10
my_array = np.random.randint(low=1, high=10, size=(n_total, n_features))
features = [f'f{i}' for i in range(1, n_features+1, 1)]
X = pd.DataFrame(my_array, columns=features)
X_train = X.sample(frac=train_ratio)
X_test = X.drop(X_train.index)

# Training
model = pca(n_components=5, alpha=0.5, n_std=3, onehot=False, normalize=True, random_state=42)
results, param_dict = model.fit_transform(X=X_train[features], row_labels=None, col_labels=None, verbose=3)
T2_train = np.log(results['outliers']['y_score'])
T2_mu, T2_sigma = T2_train.agg(['mean', 'std'])
T2_limit = T2_mu + T2_sigma*3

# Inference
PC_test = model.transform(X=X_test[features], row_labels=None, col_labels=None, verbose=3)
PC_test = np.array(PC_test)
scores, _ = model.compute_outliers(PC=PC_test, n_std=3, param_dict=param_dict, verbose=3) 
T2_test = np.log(scores['y_score'])

# Plot
plt.figure(figsize=(14, 4))
plt.axhline(T2_mu, color='blue')
plt.axhline(T2_limit, color = 'red', linestyle = 'dashed')
plt.scatter([i for i in range(T2_train.shape[0])], T2_train, c='black', s=100, alpha=0.5)
plt.scatter([i for i in range(T2_train.shape[0], T2_train.shape[0]+T2_test.shape[0], 1)], T2_test, c='blue', s=100, alpha=0.5)
plt.show()

image

erdogant commented 3 years ago

Yes great! I did make some minor changes to simplify the usage. The "param_dict" is named "outliers_params" and is stored in the object itself. This means that you do not need to specify this anymore for outlier detection.

# Training
model = pca(n_components=5, alpha=0.5, n_std=3, normalize=True, random_state=42)
results = model.fit_transform(X=X_train[features])

# Inference: mapping of data into space.
PC_test = model.transform(X=X_test[features])
# Compute new outliers
scores, _ = model.compute_outliers(PC=PC_test, n_std=3, verbose=3)