YingfanWang / PaCMAP

PaCMAP: Large-scale Dimension Reduction Technique Preserving Both Global and Local Structure
Apache License 2.0
512 stars 52 forks source link

Storing PaCMAP on DB? #34

Open guilherme-marchezini opened 2 years ago

guilherme-marchezini commented 2 years ago

Hello. I'm trying to store the PaCMAP model in a db for further transformations. I tried to pickle, but the tree is an annoy.annoy object. Also tried to save the annoy.annoy object with embedding.tree.save('./annoy_object.ann'), this works but I cannot load, since creating the PaCMAP do not initialize the annoy.annoy tree. Is there a way to save/load PaCMAP object or tree? My main objective is to send it to a DB, so I can transform new incoming data in my clustering pipeline.

Thanks for your attention.

hyhuang00 commented 2 years ago

Have you tried to directly load the annoy instance? It could be done using something like this:

embedding = pacmap.PaCMAP() # initialize/load the saved pacmap instance
embedding.tree = load_annoy_tree() # your function that loads the annoy instance
guilherme-marchezini commented 2 years ago

Hello! I did tried what you suggested, and even completed the others attributes that the method required to run:

u = AnnoyIndex(0)
u.load('test.ann')
embedding.tree  = u
embedding.xmin = emb_model.xmin
embedding.xmax = emb_model.xmax
embedding.xmean = emb_model.xmean
embedding.tsvd_transformer = emb_model.tsvd_transformer
embedding.pair_FP = emb_model.pair_FP
embedding.pair_MN = emb_model.pair_MN
embedding.pair_neighbors = emb_model.pair_neighbors
embedding.n_neighbors = emb_model.n_neighbors
embedding.transform(feature_matrix_c)

But I still get:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
/tmp/ipykernel_26194/902635302.py in <module>
----> 1 embedding.transform(feature_matrix_c)

/opt/conda/lib/python3.9/site-packages/pacmap/pacmap.py in transform(self, X, basis, init, save_pairs)
    932                                      self.apply_pca, self.verbose)
    933         # Sample pairs
--> 934         self.pair_XP = generate_extra_pair_basis(basis, X,
    935                                                  self.n_neighbors,
    936                                                  self.tree,

/opt/conda/lib/python3.9/site-packages/pacmap/pacmap.py in generate_extra_pair_basis(basis, X, n_neighbors, tree, distance, verbose)
    417 
    418     for i in range(npr):
--> 419         nbrs[i, :], knn_distances[i, :] = tree.get_nns_by_vector(
    420             X[i, :], n_neighbors_extra, include_distances=True)
    421 

IndexError: Vector has wrong length (expected 0, got 17)
hyhuang00 commented 2 years ago

Seems like the problem is in your initialization of the AnnoyIndex. It seems like the number of dimensions you are using is 17, therefore for loading the annoy index, you should initialize it with u = AnnoyIndex(17) instead of u = AnnoyIndex(0).

guilherme-marchezini commented 2 years ago

For some reason I cannot load the saved PaCMAP with index 17. I have to load with index 18, but this crashes the transform function. Idk if this is a PaCMAP problem or annoy index problem. But it would be nice to have a PaCMAP function to correctly save and load its models.

hyhuang00 commented 2 years ago

I see. We will work on that feature.