eamid / trimap

TriMap: Large-scale Dimensionality Reduction Using Triplets
Apache License 2.0
304 stars 20 forks source link

Illegal instruction error #9

Closed timjim333 closed 4 years ago

timjim333 commented 4 years ago

Hi, I just installed TriMap and its dependencies (from conda). I ran the NIST demo script:

import trimap
from sklearn.datasets import load_digits
digits = load_digits()
embedding = trimap.TRIMAP().fit_transform(digits.data)

Can you give me some advice on how to debug this? I attached my conda environment (conda_env.txt), if that helps. I'm excited to give this a go!

Many thanks and kind regards, Tim

eamid commented 4 years ago

Hi,

This might be happening because of scikit-learn: https://stackoverflow.com/questions/30440426/why-does-scikit-learn-cause-core-dumped

Can you please try running sklearn.decomposition.TruncatedSVD on your data and see if you get the same error?

Thanks, Ehsan

timjim333 commented 4 years ago

Hi Ehsan,

Thanks for your reply. I gave the following a go, but it seems to work fine:

from sklearn.datasets import load_digits
import sklearn.decomposition 

digits = load_digits()
svd = sklearn.decomposition.TruncatedSV()
svd.fit(digits.data)

and it returns:

TruncatedSVD(algorithm='randomized', n_components=2, n_iter=5,
             random_state=None, tol=0.0)

So that didn't seem to cause the crash. Do you have any other suspicions?

Thanks! Tim

eamid commented 4 years ago

Is svd.fit_transform also working fine?

Based on the output, my suspicion is that the error is happening at the initialization where PCA(n_components=n_dims).fit_transform(X) is being called.

timjim333 commented 4 years ago

Hi Ehsan,

Thanks for the thoughts. On running svd.fit_transform(digits.data) I get the following output without a crash, so maybe it wasn't this throwing the error.

array([[45.86127719, -1.19239772],
       [55.52967927,  7.86195715],
       [55.8278837 ,  6.91464166],
       ...,
       [65.52698526, 10.65790398],
       [58.60616587, -4.9121261 ],
       [64.44823101, -0.45623615]])

I've tried to step through the code using a breakpoint() and it seems that it is the AnnoyIndex(dim, metric=distance) class call (line 300) inside generate_triplets, which triggers the crash. I'm not familiar with this library so I'm not sure why this might be! Do you have any ideas?

Thanks again, Tim

timjim333 commented 4 years ago

Hi Ehsan,

Sorry for all the trouble! It turns out that the code was broken here and they have since released a fix.

After doing a conda update on my environment, which pulled a fixed annoy, it runs fine!

Thanks a lot for the support! I look forward to testing it on real data.

Take care, Tim

eamid commented 4 years ago

Hi Tim,

Thank you for the update! Hope you find TriMap useful :)

Thanks, Ehsan

timjim333 commented 4 years ago

Hi Ehsan,

Sorry, a quick question (not related to the bug) - can I check, is it an issue to have violated triplets/loss? When I ran the above code with the digits sample dataset, I saw this in the output.

running TriMap with dbd
Iteration:  100, Loss: 50.786, Violated triplets: 0.0514
Iteration:  200, Loss: 47.647, Violated triplets: 0.0482
Iteration:  300, Loss: 45.711, Violated triplets: 0.0463
Iteration:  400, Loss: 44.470, Violated triplets: 0.0450
Elapsed time: 0:00:02.810388

Many thanks and kind regards, Tim

eamid commented 4 years ago

Hi Tim,

Yes, violated triplets are expected since you cannot satisfy all the triplets sampled from higher dimension in 2D (because there is lower degree of freedom in the low dimensional space). You can view it as a measure of goodness of the fit (lower is better) and conceptually, it is similar to classification error of the training set. In my experience, percentage of violated triplets is usually high when the dataset is extremely noisy. In this dataset, only 0.045% of the triplets are violated.

Please let me know if you have further questions.

Thanks, Ehsan

timjim333 commented 4 years ago

Hi Ehsan,

Thanks very much for your explanation! I see, that makes a lot of sense. Do you have any rules of thumb for what might be an upper acceptable maximum percentage?

Kind regards, Tim

eamid commented 4 years ago

Hi Tim,

The number really depends on the dataset (and how many noisy/outlier points it has). For instance, on the TV News dataset (which has a bunch of outliers), the percentage of violated triplets is around 0.24%. I cannot say for sure what the upper-bound might be, but generally the number is much less than 1%.

Best, Ehsan

timjim333 commented 4 years ago

Hi Ehsan,

Many thanks. I appreciate your time and tips! I'll try it out with my dataset and see what I get. I was looking at your plots in your paper - I was wondering, based on the 2D output in your embedding array, how do you then classify the groups (in order to separate them into the different categories)? Apologies if this is a simple question!

Kind regards, Tim

eamid commented 4 years ago

Hi Tim,

In general, the label information is not used for finding the embedding. Rather, the labels are used only for plotting. In other words, you can use the labels to color the data points in a scatter plot. You can also use the labels for defining quality measures. For instance, assume that the labeling of the data is the optimal way of clustering the points. Thus, by clustering the points in lower dimension and comparing the clusters (i.e. how many pairs of points have identical labels and are in the same cluster in the low dimensional embedding, see rand index), you can define some type of score. However, the quality of these type of measures depends on how accurate the labeling is and how well the classes are separated originally. In the paper, we only use the label information for coloring and use nearest-neighbor accuracy and global score as the main quality measures.

Hope this answers your question.

Best, Ehsan

timjim333 commented 4 years ago

Hi Ehsan,

Sorry, maybe I need to play with a bit more to understand what you mean - where might I find the label information? As far as I can tell, the embedding is a 2D nx2 array, so I was not sure how to identify which groups of data belonged together!

Thanks again, Tim

eamid commented 4 years ago

Hi Tim,

You can access the labels in this particular dataset using the digits.target key. In order to visualize this example, you can do

import trimap
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt

digits = load_digits()
labels = digits.target

embedding = trimap.TRIMAP().fit_transform(digits.data)
plt.scatter(embedding[:,0], embedding[:,1], s=0.1, c=labels)
plt.show()

For other datasets, you should figure out how to access the labels. Please let me know if you have any questions.

Best, Ehsan

timjim333 commented 4 years ago

Dear Ehsan,

Thank you very much! It has clicked in place now. Sorry, that was a daft question - I didn't realise that the labels were stored separately as an nx1 matrix in a different attribute and that the embedding would not rearrange the order of points.

Thanks again for your help! Take care, Tim

eamid commented 4 years ago

Hi Tim,

No problem, glad that I could help :)

Thank you, Ehsan

timjim333 commented 4 years ago

Hi Ehsan,

Thanks for the help so far (and beautifully coded scripts). I thought you might be well-placed to answer a thought that I had if you have a spare moment! I was wondering if there was a big difference or not between this type of method (such as the ones you compare in your paper) and with self-organising maps? As in this case, both SOMs and TriMap seem to be able to accurately map global feature spaces - is there a particular advantage or disadvantage that you might want to consider one over the other? (In my case, I want to try and represent my multi-objective optimisation more insightfully.)

Many thanks and kind regards, Tim

eamid commented 4 years ago

Hi Tim,

This is an excellent question. We did not compare to SOMs in the paper because I could not find an implementation that could scale to the size of the datasets we tried. Also, SOM performs reasonably well by means of preserving the global structure of the data, but locally the results are not too great. However, on smaller datasets that I could try SOM, TriMap was providing much better results.

I do not know a reasonable explanation for the good global performance of SOM (personally I have not used SOMs a lot). It might be related to the way SOM finds the embedding in which the neighborhood structure is maintained by considering a neighborhood function while globally, the points are allowed to move freely.

If you find a better explanation for this please let me know. I am also curious.

Thanks, Ehsan

SamGG commented 4 years ago

Hi, Good discussion. I don't have an answer neither. I just want to point you to two implementations that were not considerated in your repository's figures and that scale with millions of points. Best. https://github.com/LCSB-BioCore/GigaSOM.jl https://github.com/omiq-ai/Multicore-opt-SNE

eamid commented 4 years ago

Thanks for the pointers @SamGG! These look great, I will definitely give it a shot.

I am currently working on a JAX implementation of TriMap (which will hopefully run even faster). I will consider comparing to these works in the future.

Thank you, Ehsan

timjim333 commented 4 years ago

Thanks for the thoughts @eamid and @SamGG - I will see if I can see any difference between the two based on my optimisation dataset. Thanks again! I look forward to hearing about any findings if you come across anything!