erdogant / pca

pca: A Python Package for Principal Component Analysis.
MIT License
284 stars 42 forks source link

Enable to collect parameter in T2 and SPE and reuse in the future for monitoring purpose (quality control chart context) #16

Closed hovinh closed 3 years ago

hovinh commented 3 years ago

This PR is related to the Issue #15. Problem statement: To employ pca package as a monitoring method, in form of a quality control chart.

Changes I have made:

Code to test out the new change:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from pca import pca

# Load dataset
n_total, train_ratio = 10000, 0.8
n_features = 10
my_array = np.random.randint(low=1, high=10, size=(n_total, n_features))
features = [f'f{i}' for i in range(1, n_features+1, 1)]
X = pd.DataFrame(my_array, columns=features)
X_train = X.sample(frac=train_ratio)
X_test = X.drop(X_train.index)

# Training
model = pca(n_components=5, alpha=0.5, n_std=3, onehot=False, normalize=True, random_state=42)
results, param_dict = model.fit_transform(X=X_train[features], row_labels=None, col_labels=None, verbose=3)
T2_train = np.log(results['outliers']['y_score'])
T2_mu, T2_sigma = T2_train.agg(['mean', 'std'])
T2_limit = T2_mu + T2_sigma*3

# Inference
PC_test = model.transform(X=X_test[features], row_labels=None, col_labels=None, verbose=3)
PC_test = np.array(PC_test)
scores, _ = model.compute_outliers(PC=PC_test, n_std=3, param_dict=param_dict, verbose=3) 
T2_test = np.log(scores['y_score'])

# Plot
plt.figure(figsize=(14, 4))
plt.axhline(T2_mu, color='blue')
plt.axhline(T2_limit, color = 'red', linestyle = 'dashed')
plt.scatter([i for i in range(T2_train.shape[0])], T2_train, c='black', s=100, alpha=0.5)
plt.scatter([i for i in range(T2_train.shape[0], T2_train.shape[0]+T2_test.shape[0], 1)], T2_test, c='blue', s=100, alpha=0.5)


erdogant commented 3 years ago

Yes great! I did make some minor changes to simplify the usage. The "param_dict" is named "outliers_params" and is stored in the object itself. This means that you do not need to specify this anymore for outlier detection.

# Training
model = pca(n_components=5, alpha=0.5, n_std=3, normalize=True, random_state=42)
results = model.fit_transform(X=X_train[features])

# Inference: mapping of data into space.
PC_test = model.transform(X=X_test[features])
# Compute new outliers
scores, _ = model.compute_outliers(PC=PC_test, n_std=3, verbose=3)