fredhallgren / inkpca

Incremental kernel PCA
Apache License 2.0
29 stars 8 forks source link

Incremental KPCA for Python 3.6 #3

Open indrajitsg opened 5 years ago

indrajitsg commented 5 years ago

Hi, I am trying to use this in Python 3.6 but it is failing at places. Just wondering if you are planning to release a new version of incremental kpca which will for with Python 3.6 and above.

fredhallgren commented 5 years ago

Thank you for your interest in the package. I will attempt to adapt it to Python 3.

indrajitsg commented 5 years ago

Hi Fredrik - thanks a lot for your response. I have made some changes to the code to make it run in Python 3.6. However after running the incremental kpca on a dataset - I am still uncertain how to extract the components. This is what I am doing:

import numpy as np
from modules.incremental_kpca import IncrKPCA, nystrom_approximation
from modules.kernels import kernel_matrix, rbf, adjust_K, median_distance
from numpy import dot, diag

# Generate random array
np.random.seed(12345)
B = np.random.rand(20, 5)

# Run incremental kpca with rbf kernel
sigma = median_distance(B, n=20)
kernel = lambda x, y: rbf(x, y, sigma)
mmax = 5
m0 = 3

inc = IncrKPCA(B, m0, mmax, adjust=True, kernel=kernel)

I am not sure how to proceed after this.

Thanks.

fredhallgren commented 5 years ago

You can loop over "inc" like so

for i, eigenvalues, eigenvectors in inc: pass

and it will return the the eigenvalues and -vectors in sequence. You could also do

i, eigenvalues, eigenvectors = inc.next()

indrajitsg commented 5 years ago

Hi Fredrik,

I am trying to extract the 5 eigen values and 5 eigen vectors from the matrix (B has rank 5). My understanding is, I can take a dot product of the data matrix B with the eigen vectors which will give me all the components.

So to get the 5 eigen vectors I run the function with m0=4 and mmax=5. Is this the correct approach?

import numpy as np
from modules.incremental_kpca import IncrKPCA, nystrom_approximation
from modules.kernels import kernel_matrix, rbf, adjust_K, median_distance
from sklearn.decomposition import KernelPCA
import matplotlib.pyplot as plt

# Generate random array
np.random.seed(12345)
B = np.random.rand(20, 5)
B

# Run incremental kpca with rbf kernel
sigma = median_distance(B, n=20)
kernel = lambda x, y: rbf(x, y, sigma)
mmax = 5
m0 = 4

inc = IncrKPCA(B, m0, mmax, adjust=True, kernel=kernel)

for i, eigenvalues, eigenvectors in inc:
    print(i)
    print(eigenvalues)
    print(eigenvectors)

The output came out as:

4
[-1.51421695e-12  2.47813156e-01  5.90773766e-01  6.32180703e-01
  1.08606855e+00]
[[-0.4472136   0.51423986 -0.33638641  0.49491613  0.42125951]
 [-0.4472136  -0.56514951  0.54885696  0.32265923  0.27432298]
 [-0.4472136  -0.14015943 -0.25523994  0.25450759 -0.80649476]
 [-0.4472136   0.53042823  0.53106017 -0.46023249 -0.15750251]
 [-0.4472136  -0.33935915 -0.48829079 -0.61185046  0.26841478]]

If I now take a dot product of B with transpose of the eigen vectors - this will give me the Kernel Principal Components:

B.dot(eigenvectors.T)
fredhallgren commented 5 years ago

The above looks correct to extract the eigenvalues and eigenvalues.

The eigenvectors of the kernel matrix are proportional to the principal scores of the feature space data. Depending on the kernel used the principal components might not be easily accessible, and possibly of infinite length (which for example is the case with the radial basis function/Gaussian kernel).

JrOcira commented 5 years ago

Dear Mr Hallgren, Greetings! I am currently working on incremental kernel PCA with application and your paper has been very useful and would want to use your algorithm. I followed your examples and also the conversations you had on the issue "Incremental KPCA for Python 3.6". I was able to run the codes on my data and by specifying mmax and mo to 3 and 2 respectively, am able to get a 3 by 3 eigen vectors matrix and 3 by 1 eigen values matrix.

However, my data is inclusive of 400 individuals with 3 variables. If you run the normal R function, you get an eigen vector dimension of total number of individuals by total number of variables (400 rows by 3 columns). I would want to get this eigen vectors for all the individuals in my dataset. My question is how can I get this matrix using your algorithm. Thanks for your time.

fredhallgren commented 5 years ago

Hi JrOcira,

Glad that you found it useful :) The argument m0 is the initial size of the kernel matrix, using the first data points, then the algorithm calculates the eigenvectors after adding all the subsequent data points incrementally, until mmax data points have been used. So if you specify mmax to be 400 you should get all eigenvectors in the end.

JrOcira commented 5 years ago

Dear Mr Hallgren, Thanks for the reply. I did specify the mmax size as you had suggested, However I have been receiving error showing;

File "D:\JUNIOR\PIPELINE_TRIAL\DEMO1\IKPCA\inkpca\incremental_kpca.py", line 107, in next return next(self)

File "D:\JUNIOR\PIPELINE_TRIAL\DEMO1\IKPCA\inkpca\incremental_kpca.py", line 93, in next rc = self.update_eig()

File "D:\JUNIOR\PIPELINE_TRIAL\DEMO1\IKPCA\inkpca\incremental_kpca.py", line 126, in update_eig col = self.idx[i]

IndexError: index 5 is out of bounds for axis 0 with size 5

############################## Here is the sample code I run

infile = open(r'D:\JUNIOR\PIPELINE_TRIAL\DEMO1\IKPCA\data\ACSL6_kernel_projection.txt') data = [] for line in infile.readlines(): line = line.split('\t') data.append(line) data = np.asarray(data, dtype=np.float64)

sigma = median_distance(data) kernel = lambda x, y: rbf(x, y, sigma) mmax = 360 m0 = 4
inc = IncrKPCA(data, m0, mmax, adjust=False, kernel=kernel)

print(eigenvalues)

for g, eigenvalues, eigenvectors in inc: print(g) print(eigenvectors) print(eigenvalues) ##############################

I am wondering what I might have done in appropriate or what I can change from those files. Thanks for your time.

Kind Regards

inesani commented 3 years ago

Hi @cfjhallgren, I am very interested in using your implementation for my research.

I have been trying to understand the output I get by following the guidance you gave to @indrajitsg using the same data. Unfortunately, I am not getting the results I expected (sklearn comparison). The data has rank 5, and I use m0=4 and mmax=5. My code is:

import numpy as np
from incremental_kpca import IncrKPCA, nystrom_approximation
from kernels import kernel_matrix, rbf, adjust_K, median_distance
from sklearn.decomposition import PCA

# Generate random array
np.random.seed(12345)
X = np.random.rand(20, 5)
mean = np.mean(X, axis = 0)

# Run incremental kpca with rbf kernel
sigma = median_distance(X, n=20)
kernel = lambda x, y: rbf(x, y, sigma)
mmax = 5
m0 = 4

inc = IncrKPCA(X, m0, mmax, adjust=True, kernel=kernel)

for i, eigenvalues, eigenvectors in inc:
    print(i)
    print(eigenvalues)
    print(eigenvectors)

print('')
print('INKPCA Principal Components:')
print(X.dot(eigenvectors.T))

# Just in case: check equality after mean substraction -> not equal
# principalComponents_homemade = np.dot(X - np.mean(X, axis = 0), eigenvectors.T)
# print(principalComponents_homemade)

# Comparing with Sklearn
pca = PCA(n_components=5)

print('')
print('SKLEARN Feature Vector:')
print(pca.fit(X).components_)

print('')
print('SKLEARN Principal Components:')
principalComponents = pca.fit_transform(X)
print(principalComponents)

The principal components I get using inkpca and sklearn are very different. Can you provide an explanation or an example on how to use this ? That would be of much help. Also, where do I state the number of components I want? I am assuming it is in mmax but that is only based on the fact that it's the only value that allows me to perform the dot product between X and the eigenvectors obtained without a dimension error. Thank you.

fredhallgren commented 3 years ago

Thank you for your interest in the package! Above seems the RBF kernel is used in the incremental algorithm, if you use the linear kernel you should be able to get the same result! (If I remember correctly you need to normalize by the eigenvalues).

All the principal components are returned so you can pick out the ones you want afterwards!

Regards Fred

inesani commented 3 years ago

Thank you so much for your quick reply @cfjhallgren, your package would really help me on my project, I am almost there!

I have done as you said and used a linear kernel like this:

def linear(x, y):
    """
    Linear kernel function
    """
    return np.dot(x, y.T)

But I don't know what you mean exactly by normalize by the eigenvalues, I've divided each eigenvector by it's eigenvalue but I don't think this is it because the results are still way different from sklearn. My eigenvalues are also different from the ones sklearn produces. You can see the comparison below:

import numpy as np
from incremental_kpca import IncrKPCA, nystrom_approximation
from kernels import kernel_matrix, rbf, adjust_K, median_distance, linear
from sklearn.decomposition import PCA

# Generate random array
np.random.seed(12345)
X = np.random.rand(20, 5)
mean = np.mean(X, axis = 0)
#X = X - mean

# Run incremental kpca with rbf kernel
sigma = median_distance(X, n=20)
kernel = lambda x, y: linear(x, y)

mmax = 5
m0 = 4

inc = IncrKPCA(X, m0, mmax, adjust=True, kernel=kernel)

for i, eigenvalues, eigenvectors in inc:
    print(i)
    print('INKPCA Eigenvalues:')
    print(eigenvalues)
    print('')
    print('INKPCA Eigenvectors:')
    print(eigenvectors)

print('')
print('INKPCA Normalized Eigenvectors:')
eigenvectors = eigenvectors / eigenvalues[:,None] # Not sure of this!
# eigenvectors = eigenvectors.T / eigenvalues[:,None] # Also tried
print(eigenvectors)

print('')
print('INKPCA Principal Components:')
print(X.dot(eigenvectors.T))

# Comparing with Sklearn
pca = PCA(n_components=5)

print('')
print('SKLEARN Eigen Values:')
print(pca.fit(X).singular_values_)

print('')
print('SKLEARN Feature Vector:')
print(pca.fit(X).components_)

print('')
print('SKLEARN Principal Components:')
print(pca.fit_transform(X))

Also, I am confused by the mmax argument, in my example my matrix has dimension and rank 5, so at most I understand that I can get 5 eigenvectors and 5 eigenvalues, but if I increase mmax, it produces more than 5 eigenvectors-values. Is this normal ?

damrob commented 3 years ago

Hi @cfjhallgren, I am a PhD student currently trying to use your package but I am having the same problem as @inesani.

I was wondering if by "normalize by the eigenvalues" you meant using the function rank_one_eigenvectors from the file eigen_update.py. From what I understand it looks like normalization but I'm not sure on how to use it.

Is it possible for you to provide a reproducible example of this behavior based on the example provided by @inesani?

Thank you very much for the package and for your answers.