amzn / pecos

PECOS - Prediction for Enormous and Correlated Spaces
https://libpecos.org/
Apache License 2.0
509 stars 105 forks source link

Trying to understand Cluster chain shape using hybrid indexer #249

Open preetbawa opened 1 year ago

preetbawa commented 1 year ago

Description

i am using HybridIndexer to do label indexing, using PIFA Embedding for labels by combining label one hot encoded matrix with input feature matrix. i ran following code:

Build Hybrid Index using Trie and Clustering approach together with max trie depth of 3 and max_leaf size of 100

from pecos.utils.cluster_util import ClusterChain

Hierarchical Clustering Chain as a list of Sparse Matrices - sparse matrice to represent clusters at a specific depth

cluster_chain: ClusterChain = HybridIndexer.gen(feat_mat=label_features, label_strs=labels_sorted, depth=3, max_leaf_size=100, seed=0, max_iter=40, spherical_clustering=True )

and then run this to understand shape

Explore Shape of Cluster Chain

for i in range(len(cluster_chain)): print(type(cluster_chain[i])) print(f"shape of current matrix {cluster_chain[i].shape} " )

RESULTS: shape of current matrix (24, 1) <class 'scipy.sparse.csr.csr_matrix'> shape of current matrix (184, 24) <class 'scipy.sparse.csr.csr_matrix'> shape of current matrix (701, 184) <class 'scipy.sparse.csc.csc_matrix'> shape of current matrix (2076, 701)

it makes sense that there are 4 clusters, as trie depth is3 but does that mean for hierarchical clustering there is only one level - also last matrix is of shape 2076, 701 does that mean there are 701 clusters at leaf level - my sorted unique labels are only 2076 , why so many clusters are created at leaf level or am i understanding this piece correctly, also why its the hierarchical part of tree is flat with only one level depth , i would assume it would branch out after trie or is it in hybrid indexing you only get one level of clustering after trie is built with clusters built as the last depth level - that still maybe makes sense but not sure why so many clusters as 701 built for labels that are only 2k approx.

nishant2yadav commented 1 year ago

Hi @preetbawa, cluster_chain contains list of cluster matrixes encoding connections between nodes at level i and level i+1. Let's assuming root node is level 0. In your example, a) There are 24 nodes at level 1 -- which is encoded in a 24x1 matrix. b) There are 184 nodes in level 2 and parent-child relations between 24 nodes from level 1 and 184 nodes from level 2 are encoded in this 184x24 sparse matrix. Each row contains exactly one non-zero entry corresponding to the parent node. if (i,j)-entry is 1 then it means that node i in level 2 has node j in level 1 as its parent. c) Similarly, there are 701 nodes in level 3 and finally 2076 nodes in level 4.
a) Level 4 is actually just every node with in a singleton cluster so you can treat level 3 as the final level.

Why are so many clusters created at leaf level?

A lot these 701 clusters might be singleton clusters. The way cluster_chain is set up is that if a certain branch in the tree-based index ends up with just 1 node at level k while the tree depth is d, then this singleton branch is extended to depth d by creating (d - k) dummy internal nodes and chaining them together. This helps maintain the invariant that all leaf nodes at present at the same depth in the tree.

You can check how many of these 701 clusters are singleton by looking at columns of 2076 x 701. The sum jth column corresponds to number of child nodes from cluster j.

The depth parameter controls the depth of the trie in the hybrid-index (3 in your case). HybridIndexer uses hierarchical clustering beyond given depth parameter until each leaf node has at most max_leaf_size datapoints. These singleton clusters are probably created as the index only creates a trie with those 2076 datapoints which may result in a lot of singleton clusters (this would correspond to number of unique length 3 prefixes in your datapoints?). And after creating the trie, hierarchical clustering is not invoked in this example as all clusters probably already contain less than max_leaf_size=100 datapoints.

preetbawa commented 1 year ago

thanks Nishant, i did get cluster chain part but wanted some confirmation which your provided with good detail. here are two questions if you can please answer: a) is each node where you mention node, i am assuming that's cluster right - so cluster at depth i vs cluster at depth j. meaning each cluster having one or more labels ?

b) how can we fix this issue we are facing where each cluster is containing only one label - can we reduce depth of trie index part to say 1 or 2 - would that help ?

nishant2yadav commented 1 year ago

a) is each node where you mention node, i am assuming that's cluster right - so cluster at depth i vs cluster at depth j. meaning each cluster having one or more labels ?

Yes. Sorry for using nodes and clusters interchangeably. Internal nodes in the tree correspond to clusters containing all leaf nodes (actual datapoints) in the subtree corresponding to the internal node.

b) how can we fix this issue we are facing where each cluster is containing only one label - can we reduce depth of trie index part to say 1 or 2 - would that help ?

Reducing the trie depth to 1 or 2 can help but you can also post-process to explicitly merge those singleton branches. You can also just use hierarchical clustering based indices instead of using hybrid Trie + Hierarchical clustering based indices. Hierarchical clustering based indices will have a fixed branching factor and can offer reasonable performance when combined with position weighted tfidf vectorization. You can also try config-11 from Table 2 of the paper. This respects the structure of the trie but also ensures fixed branching factor. Also, it depends on the size of your label set and exact distribution of labels. If you just have 2000 labels or datapoints, then maybe just using hierarchical clustering should work fine, and building an trie based index will not be as helpful!

preetbawa commented 1 year ago

thanks Nishant, that helps

preetbawa commented 1 year ago

one quick question, for this issue [https://github.com/amzn/pecos/issues/247] we are unable to save the model to disk, i checked internally weights matrix, and cluster matrix both are csr matrix, not sure why we os error operation not supported, guys who wrote this code haven't responded in two weeks, is another way we can get some help for this issue, its blocking us to really use load this model in an application and do real-time inference.

thanks