PyMDE getting stuck on "Computing quadratic initialization"

fsaghir commented 3 years ago

Hi.

I have had some on and off luck with using pymde. I am trying the embedding on a dataset of images and the way I have been working on my experiment is as follows:

Extract images from folders and save them as a numpy array
Convert images using a CNN-AutoEncoder (CAE) model to 1x1x8 code
Use pymde to calculate the embeddings
Use HDBSCAN to cluster the embeddings

The experiment works great when i use images from 2 folders (2list.npy). But when I increase the number of images (3list.npy, 10list.npy), the code gets stuck at "Computing quadratic initialization".

In the google drive link below, you will find the following:

Jupyter Notebook
requirements.txt
Numpy array files
CAE model

The code in the notebook is self explanatory. To duplicate the issue, first use "2list.npy", which should run successfully. Then try "3list.npy" and "10list.npy", which should get stuck at "Computing quadratic initialization".

https://drive.google.com/file/d/17kktp6W1Lq_7PjxCiEQBEAk_u-kbDLEN/view?usp=sharing

akshayka commented 3 years ago

Hey, thanks for trying out PyMDE!

Computing the quadratic initialization can take some time if the number of items is large.

I assume you're using the preserve_neighbors function. Can you try the following? Pass in init='random' to the function. That will bypass the quadratic initialization computation.

So, something like

pymde.preserve_neighbors(data, init='random')

The init strategy is explained in the docs here: https://pymde.org/api/index.html#pymde.preserve_neighbors

Let me know if that works?

fsaghir commented 3 years ago

Thanks! That did help. However, I did notice that at times the quadratic initialization for a large dataset would run very quickly (this happens totally at random and I am not able to reproduce this consistently) .

Additionally, the files I have loaded do not have much difference in number of data rows. "2list.npy" has ~25K rows and the quadratic initialization with this dataset runs almost immediately after running the embedding function with preserve_neighbors. But when I do the same with "3list.npy", which has ~38K rows, it takes anywhere from 30-40 minutes to compute the quadratic initialization. My assumption was that for additional 13K rows the code would relatively run faster.

Does duplicate items have an impact on computing quadratic initialization?

akshayka commented 3 years ago

Wow, 30-40 minutes is way longer than I would expect! Computing a quadratic embedding on MNIST (70k items) takes just a few seconds. Thanks for letting me know.

Generally you shouldn't have duplicate items when using PyMD, but I wouldn't expect the presence of duplicates to increase runtime.

I'll take a look at your data later this week.

akshayka commented 3 years ago

@fsaghir,

Apologies for the delay, I was out on vacation last week.

I just took a look at your notebook. I'm actually unable to run it. When attempting to load the encoder, the notebook throws the following error:

OSError: SavedModel file does not exist at:

I changed the relative path to a full path, and got the same error.

I did this in a fresh virtual environment. In fact I was unable to even install the requirements. pip install -r requirements.txt threw an error,

ERROR: Cannot install matplotlib==3.2.2 and matplotlib==3.4.1 because these package versions have conflicting dependencies.

Deleting the matplotlib requirements from the file and trying again led to another error (ERROR: Could not find a version that satisfies the requirement mkl-service==2.3.0), so at that point I gave up on installing the exact requirements.

If you give me the output of the encoder for 2list/3list, (so I don't have to run the tensorflow model), that will make is much easier for me to help/debug this issue. Would that be possible?

fsaghir commented 3 years ago

@akshayka,

This time apologies from my side for a late response. Here is the new link with the requested encoded files:

https://drive.google.com/file/d/1nNjBfK5q6_sZrGj3IHnlku93SPwKEJCw/view?usp=sharing

fsaghir commented 3 years ago

Closed the issue by mistake. Have reopened it.

akshayka commented 3 years ago

@fsaghir, thanks for providing the encoded data.

The data matrices (3list_encoded.py, 10list_encoded.py) have many thousands of duplicated rows (the first has only 510 unique rows, the second has 950 unique rows). This appears to yield very ill-conditioned quadratic MDE problems, which in turn makes computing the spectral initialization via an eigensolver very slow.

The default solver is scipy.eigsh.

There are a few things I can try.

Aggressively use a small number of max iterations for the spectral initialization; the initialization only needs to be okay, ie no need to solve the quadratic problem exactly.
Use LOBPCG by default, instead of eigsh, again with a small number of max iterations.
Use PyMDE's generic L-BFGS algorithm to approximately compute the initialization. As a bonus, this would let users use a GPU to compute the quadratic initialization, which might help for very large problems.

I'm not yet sure what the trade-offs between these options are; I'll evaluate them across several different embedding problems.

For your particular application: your embeddings could be computed much more quickly if you only passed in the unique rows of your data matrices. You can then back out the embeddings for the unique rows to embeddings for the original collection of items (which appears to contain duplicates).

cvxgrp / pymde

PyMDE getting stuck on "Computing quadratic initialization" #34