Distributed version of algorithm

mhoward2718 commented 4 years ago

Hello. Thanks for your interesting algorithm and repository.

A nice advantage of many incremental methods is that they are also easily parallelized. Do you think your algorithm could be extended to a version that:

Accepts a set of X files instead of a single file, where each file represents a subset of rows the original matrix
Runs X processes, one on each file, with each process locally performing an incremental update for each row it reads
After all of the files have been processed (or repeatedly after a given number of rows), combines the results of each process together

From a implementation perspective a package like Ray or Dask would probably be easy to use

From a theory perspective we only have to figure out how to combine the results of the separate processes, and how frequently, right? Or is it not that simple?

mhoward2718 commented 4 years ago

This paper describes a distributed PCA algorithm containing two main steps. Maybe a similar technique could be adopted for distributed, incremental KPCA. In the first step each node locally computes the SVD of its data partition, which I think could easily be replaced by your incremental method.

http://www.cs.cmu.edu/~ninamf/papers/distributedPCAandCoresets.pdf

fredhallgren commented 4 years ago

That's an interesting idea! Sorry for the delay in getting back.

I did not think so much about parallelization yet. Would be great to use parallel computation and speed things up. Yup if there is a parallel algorithm for PCA or the eigendecomposition it should be straight-forward to do a parallel incremental algorithm! The matrix multiplications could possibly also be parallelized.

fredhallgren / inkpca

Distributed version of algorithm #6