RixRasa / Genomics

0 stars 0 forks source link

Train-Test Contamination when performing PCA #1

Closed vladimirkovacevic closed 2 months ago

vladimirkovacevic commented 3 months ago

When you build PCA components from the entire dataset you contaminated the test set.

DDeki commented 2 months ago

Hello, We appreciate your feedback regarding the contamination of the test set when building PCA components from the entire dataset. We've implemented the feature as requested. For each iteration of the k-fold cross-validation, the PCA is fitted exclusively on the training set. Subsequently, the trained PCA model is used to transform both the training set and the test set.

pca = PCA(n_components=50, svd_solver='arpack')
adata_train.obsm['X_pca'] = pca.fit_transform(adata_train.X)
adata_test.obsm['X_pca'] = pca.transform(adata_test.X)

This ensures that the test set remains independent and does not influence the principal components, maintaining the integrity of the model evaluation process.