Optimize Performance of MMD

thomas9t commented 4 years ago

Hi @darylnak,

I took a look over your implementation of the MMD - I think this is slow because it's doing a lot of looping in Python. In general, the more code you can push into built in NumPy/SciPy/SciKitLearn functions, the faster it will be. The expensive part of evaluating the kernel is computing the matrix of pairwise distances (e.g. the ||x-y|| part of the kernel). Both SciPy and SciKitLearn have built in functions to do this - I'd recommend checking these out. Note as well that SciKitLearn has several built in Kernel functions which you are welcome to use. It's good practice to try implementing this stuff on your own to see how it works, but for "production" code it's generally better to use a built in library as these will be more heavily optimized.

A couple other comments:

In this line np.sum(np.square(x - y)) I would recommend using np.linalg.norm instead. All the NumPy functions are just wrappers around C code which does the heavy lifting - there's a (small) overhead to executing nested NumPy calls since it needs to create a new Python object to store the results of np.square(). But run many thousands of times these little things can add up.
In general it's good to avoid use of the eval() function in Python. It's less readable and has security vulnerabilities in some settings to it's better to avoid where possible. It looks like you're trying to make your code agnostic to the particular kernel used which is great! To do this, you can pass a function as an argument to your mmd function. For example:

def linear_kernel(x,y):
    ...

def compute_mmd(X, Y, kernel):
    Kxx = kernel(X, X)
    Kxy = kernel(X, Y)
    Kyy = kernel(Y, Y)
    ...

X = np.random.rand(10,2)
Y = np.random.rand(10,2)
mmd = compute_mmd(X, Y, linear_kernel)

I'm not familiar with this tdqm function for creating iterators - can you explain what it does?

thomas9t commented 4 years ago

If you're interested in the gory details of how NumPy and Python actually work under the hood - I'd recommend checking out: NumPy and Python. Both Python and NumPy are exceptionally well docmented and it's fun digging around in the internals to see how stuff actually works.

darylnak commented 4 years ago

@thomas9t, thanks for this information!

I took a look at the SciKitLearn documentation and rewrote the mmd() function using the pairwise_kernel module (documentation) -- this code has been pushed. However, it seems to be crashing my kernel on seiz_Kxx = pairwise_kernels(seizures, seizures, metric='rbf'). I suspect this is still the wrong approach? Should I be looking at this? If so, I'm not sure how to use the kernel function.

tqdm is a library which provides a loading bar. It takes in an iterable, e.g. range(30), and displays a progress bar as we iterate through.

Also, thanks for letting me know about bad practices with eval(). I saw some code which used a variable as a function call, but couldn't remember the exact syntax. I think your suggestion was what I was looking for.

thomas9t commented 4 years ago

I think pairwise_kernels looks like the correct function to use - I've always just used the low-level kernel functions directly (e.g. sklearn.metrics.pairwise.rbf_kernel and friends) but it's nice pairwise_kernels lets you pass a string. What exactly is the error you're getting?

tqdm is a library which provides a loading bar

Cool!

darylnak commented 4 years ago

@thomas9t Jupyter displays the following a few seconds after executing seiz_Kxx = pairwise_kernels(seizures, seizures, metric='rbf'):

UPDATE 1:

I ran conda update --all and now the code progresses to nseiz_Kxx = pairwise_kernels(non_seizures, non_seizures, metric='rbf'), but with the same issue. I am still wondering if I am performing computation efficiently? Might possibly be running out of system resources, but I am not sure. Will continue to investigate.

UPDATE 2: Using sklearn.metrics.pairwise.rbf_kernel on seiz_Kxx = pairwise_kernels(seizures, seizures, metric=sklearn.metrics.pairwise.rbf_kernel) produces the following error:

ValueError: Expected 2D array, got 1D array instead: array=[ 2.03410517e-01 4.51378040e-01 1.06251635e-01 6.95496387e-02 1.96414478e-02 7.51336002e-03 2.27239199e-01 8.19023483e-02 1.77374634e-01 4.95086474e-02 -1.53311966e+00 2.09439866e+00 3.85736441e+03 5.96475750e+03]. Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

UPDATE 3: Was using sklearn.metrics.pairwise.rbf_kernel incorrectly. Tried calling as nseiz_Kxx = sklearn.metrics.pairwise.rbf_kernel(non_seizures, non_seizures) but getting same error.

UPDATE 4: It seems nseiz_Kxx = sklearn.metrics.pairwise.rbf_kernel(non_seizures, non_seizures) is trying to allocate 6.18 TiB for an array with shape (921819, 921819) and data type float64, causing the issue. How should I handle this?

darylnak commented 4 years ago

I have reduced computation time for processing non_seizure to ~3.5 hours using the following technique.

nseiz_Kxx = 0

# split original array into 1000 sub arrays
batches = np.array_split(non_seizures, 1000)

for batch in tqdm(batches):
    dot_prods = sklearn.metrics.pairwise.rbf_kernel(batch, non_seizures)
    nseiz_Kxx += np.sum(dot_prods)

thomas9t commented 4 years ago

@darylnak - All of these SciKitLearn functions are computing the "kernel matrix" of the data. Say you have a matrix X with N rows and a matrix Y with M rows, then the kernel matrix is an N x M matrix K where K[i,j] = kernel(X[i,:], Y[j,:]) so for the "non-seizure" kernel matrix this will be very large. Splitting up into batches is a nice idea! You could also parallelize over the batches to process several at once (check out mp.Pool) and speed things up even more.

I would also recommend looking into sampling the non-seizure data. However, we need to be careful about how we do this. A big in these kinds of applications is false positives (that is saying something is a seizure when it actually was not). This is because there are lots of things your brain does which are "atypical" but not seizures (e.g. stressful activities like solving math problems). So we need to be careful to include these "atypical" activities as well. I would recommend using a clustering algorithm (for instance K-means or GMM) to find some clusters of "typical" brain activity. Then you can assign points a probability of being sampled which is proportional to their distance from a cluster center. In other words, you will be more likely to sample distant points than close ones. This should help make sure that the set you sample is representative of many modalities in brain activity.

darylnak commented 4 years ago

@thomas9t Thanks for this, I was thinking of utilizing the multiprocess library but didn't know if it was necessary. I have been experimenting with the library and I am having minor difficulties with implementation and debugging. I have never worked with parallelization before, so it may take me some time. I will update you when I have more.

I would also recommend looking into sampling the non-seizure data.

Will look more into this when I get multiproccesing working

thomas9t commented 4 years ago

Cool - I'd recommend starting by just randomly sampling the non-seizure observations until you have a small-ish data set. Multiprocessing is useful, but it can be a bit tricky to get working so I usually try to get bugs ironed out on a simple version of code and then gradually add complexity. If you want to try it out just to learn about it though, then by all means go for it. You can sample by doing:

X = np.random.rand(1000,3)
sample = np.random.choice(X.shape[0], size=(300,), replace=False)
X_sampled = X[sample,:]

darylnak commented 4 years ago

I have been doing more work on multiprocessing but I can't get it to compute the rbf_kernel between the batches and non_seizure, even for something trivial like a matrix multiply. I created a failing example and posted a question on Stack Overflow about it here. Take a look if you're curious.

I'll continue to work more on multiprocessing, but if I can't get it working, I'll start sampling non_seizure and use multiprocessing on that. Either way, I need to figure it out because the remainder of the MMD computation will REALLY benefit.

Will update you when I have more.

darylnak commented 4 years ago

@thomas9t - I managed to get multiprocessing working in code, but it's ~4x slower than processing the data serially. I'm not comfortable with these results, so I'm going to conduct more testing.

thomas9t commented 4 years ago

Okay @darylnak sounds good. One reason multiprocessing might not be working as well is that the underlying SciKitLearn code to compute the kernel matrix is also multithreaded and so if you launch several parallel jobs these threads will be in contension. Usually, NumPy/SciPy defaults to P = min(nproc, 8) threads where nproc is the number of processors on your computer. So, you can use up to floor(nproc / P) multiprocessing jobs without causing resource contention.

To improve performance, I'd definitely recommend looking into sampling your data before Multiprocessing.

darylnak commented 4 years ago

@thomas9t just wanted to update you on what I've done so far. Since there is a lot of non-seizure data, I am using k-means clustering to separate the data into 10 different clusters. I chose this value by using something called the elbow method, which I found here. All I have left is to process each window of the subject data in the MMD calculation, but I am having a hard time accomplishing this. Specifically, computation is slow and I cannot get multiprocessing to speed up this particular part of the process because it hangs if I call the sklearn kernel function before it.

the underlying SciKitLearn code to compute the kernel matrix is also multithreaded

Is this the reason why my code hangs when I try to use multiprocessing after calling the sklearn kernel function? If so, I don't understand why the kernel function seems to complete, but causes the code which uses multiprocessing to hang (very frustrating).

thomas9t commented 4 years ago

Hey @darylnak - I've been doing some more thinking about the project and I have some ideas that I think will help speed things up a bit and likely produce better results (at least based on some preliminary work I did). Do you have some time to talk about this briefly on Monday or Tuesday next week? You might want to skim this paper too just to get the gist of it.

darylnak commented 4 years ago

@thomas9t Tuesday is best for me. I am free all day except between 5-6:20pm. What time works best for you?

thomas9t commented 4 years ago

Sorry for the slow reply - could you do 11:00 AM? We can use my zoom: https://ucsd.zoom.us/j/3186707802

darylnak commented 4 years ago

Can do!

darylnak commented 4 years ago

@thomas9t -I was looking through the code and a few questions came to mind.

In def kernel(X) from tensor_mmd.py, you initialize a square matrix of ones. Then, you seem to update the values of the upper left triangle of this matrix, except for the first column. Why is this? Does this have to do with K[ixi,ixj] = (1/float(v))*np.square(np.linalg.norm(P.T.dot(Q)))?
In def kernel(X) from tensor_mmd.py, you set the bottom left triangle of the matrix equal to the top left triangle of the same matrix (lines 100-101). Why?
In psd_mmd.py, you create an N x M matrix,Z, a collection of features for the EEG readings. Then, you pass this matrix into kernel() and sum some values in the proceeding for-loop. The issue I have is Kxy = K[:N,N:].sum() does not seem to work. Can you explain what is happening in the for-loop? Recreating this scenario in some test code produces the following results:

>>> import numpy as np
>>> t = [[1,1,1,1],[2,2,2,2],[3,3,3,3],[4,4,4,4]]
>>> t
[[1, 1, 1, 1], [2, 2, 2, 2], [3, 3, 3, 3], [4, 4, 4, 4]]
>>> t = np.vstack(t)
>>> t
array([[1, 1, 1, 1],
       [2, 2, 2, 2],
       [3, 3, 3, 3],
       [4, 4, 4, 4]])
>>> N = 4
>>> t[:N][N:]
array([], shape=(0, 4), dtype=int64)
>>> t[:N][N:].sum()
0

darylnak / SEE-Lab

Optimize Performance of MMD #2