cvxgrp / pymde

Minimum-distortion embedding with PyTorch
https://pymde.org
Apache License 2.0
538 stars 27 forks source link

PyMDE getting stuck on "Computing quadratic initialization" #34

Closed fsaghir closed 3 years ago

fsaghir commented 3 years ago

Hi.

I have had some on and off luck with using pymde. I am trying the embedding on a dataset of images and the way I have been working on my experiment is as follows:

The experiment works great when i use images from 2 folders (2list.npy). But when I increase the number of images (3list.npy, 10list.npy), the code gets stuck at "Computing quadratic initialization".

In the google drive link below, you will find the following:

The code in the notebook is self explanatory. To duplicate the issue, first use "2list.npy", which should run successfully. Then try "3list.npy" and "10list.npy", which should get stuck at "Computing quadratic initialization".

https://drive.google.com/file/d/17kktp6W1Lq_7PjxCiEQBEAk_u-kbDLEN/view?usp=sharing

akshayka commented 3 years ago

Hey, thanks for trying out PyMDE!

Computing the quadratic initialization can take some time if the number of items is large.

I assume you're using the preserve_neighbors function. Can you try the following? Pass in init='random' to the function. That will bypass the quadratic initialization computation.

So, something like

pymde.preserve_neighbors(data, init='random')

The init strategy is explained in the docs here: https://pymde.org/api/index.html#pymde.preserve_neighbors

Let me know if that works?

fsaghir commented 3 years ago

Thanks! That did help. However, I did notice that at times the quadratic initialization for a large dataset would run very quickly (this happens totally at random and I am not able to reproduce this consistently) .

Additionally, the files I have loaded do not have much difference in number of data rows. "2list.npy" has ~25K rows and the quadratic initialization with this dataset runs almost immediately after running the embedding function with preserve_neighbors. But when I do the same with "3list.npy", which has ~38K rows, it takes anywhere from 30-40 minutes to compute the quadratic initialization. My assumption was that for additional 13K rows the code would relatively run faster.

Does duplicate items have an impact on computing quadratic initialization?

akshayka commented 3 years ago

Wow, 30-40 minutes is way longer than I would expect! Computing a quadratic embedding on MNIST (70k items) takes just a few seconds. Thanks for letting me know.

Generally you shouldn't have duplicate items when using PyMD, but I wouldn't expect the presence of duplicates to increase runtime.

I'll take a look at your data later this week.

akshayka commented 3 years ago

@fsaghir,

Apologies for the delay, I was out on vacation last week.

I just took a look at your notebook. I'm actually unable to run it. When attempting to load the encoder, the notebook throws the following error:

OSError: SavedModel file does not exist at:

I changed the relative path to a full path, and got the same error.

I did this in a fresh virtual environment. In fact I was unable to even install the requirements. pip install -r requirements.txt threw an error,

ERROR: Cannot install matplotlib==3.2.2 and matplotlib==3.4.1 because these package versions have conflicting dependencies.                                              

Deleting the matplotlib requirements from the file and trying again led to another error (ERROR: Could not find a version that satisfies the requirement mkl-service==2.3.0), so at that point I gave up on installing the exact requirements.

If you give me the output of the encoder for 2list/3list, (so I don't have to run the tensorflow model), that will make is much easier for me to help/debug this issue. Would that be possible?

fsaghir commented 3 years ago

@akshayka,

This time apologies from my side for a late response. Here is the new link with the requested encoded files:

https://drive.google.com/file/d/1nNjBfK5q6_sZrGj3IHnlku93SPwKEJCw/view?usp=sharing

fsaghir commented 3 years ago

Closed the issue by mistake. Have reopened it.

akshayka commented 3 years ago

@fsaghir, thanks for providing the encoded data.

The data matrices (3list_encoded.py, 10list_encoded.py) have many thousands of duplicated rows (the first has only 510 unique rows, the second has 950 unique rows). This appears to yield very ill-conditioned quadratic MDE problems, which in turn makes computing the spectral initialization via an eigensolver very slow.

The default solver is scipy.eigsh.

There are a few things I can try.

I'm not yet sure what the trade-offs between these options are; I'll evaluate them across several different embedding problems.

For your particular application: your embeddings could be computed much more quickly if you only passed in the unique rows of your data matrices. You can then back out the embeddings for the unique rows to embeddings for the original collection of items (which appears to contain duplicates).