amzn / pecos

PECOS - Prediction for Enormous and Correlated Spaces
https://libpecos.org/
Apache License 2.0
502 stars 103 forks source link

Trying to use HybridIndexer for Label Indexing, run into issue where TrieWrapper has no attribute '_sorted' #244

Open preetbawa opened 1 year ago

preetbawa commented 1 year ago

Description

Trying to leverage XLinear Model for Autcomplete suggestion model for our use case, Trie Plus Hierarichal Clustering makes sense for use case, so we are using HybridIndexer method, and it runs into error building the cluster/s.

How to Reproduce?

For reasons of compliance, I can't put the data here, but idea is to create pandas Dataframe with 3 columns

a) prev_query, prefix, and next_query. (next_query is the label) - easy to create dummy pandas dataframe with this data

b) here search_session_training_set_sorted is the pandas df with "previous_query", "prefix", and "next_query" columns and dataframe is sorted by the label column "next_query"

Build one hot encoded label matrix wrapped in scipy csr matrix.

label_y_ohe_matrix = csr_matrix(pd.get_dummies(search_session_training_set_sorted["next_query"]).values).astype(np.float32)

Build unique set of label_strs sorted for Trie part of Indexing.

labels_unique = set(search_session_training_set_sorted["next_query"].values.flatten()) labels_unique_sorted = sorted(labels_unique)

Build prefix tf-idf position weighted char level vectorizer and get actual tfidf vectors for each prefix.

input_x_prefix_list = search_session_training_set_sorted["prefix"].tolist()

tf_idf_prefix_vectorizer = PositionProductTfidf(analyzer="char", ngram_range=(1,2), dtype=np.float32, strip_accents="unicode") input_prefix_matrix = tf_idf_prefix_vectorizer.fit_transform(input_x_prefix_list)

Build previous_query tf-idf word level vectorizer and get back tf-idf vectors for all previous_query terms.

input_prev_query_list = search_session_training_set_sorted["previous_query"].tolist() tf_idf_prev_query_vectorizer = TfidfVectorizer(analyzer="word", ngram_range=(1,1), dtype=np.float32, strip_accents="unicode")

input_prev_query_matrix = tf_idf_prev_query_vectorizer.fit_transform(input_prev_query_list)

Horizontally stack the previous query and prefix horizontally as one input feature matrix (csr format)

input_feature_matrix = normalize(smat.hstack([input_prev_query_matrix, input_prefix_matrix]), "l2", axis=1)

Build label features using PIFA Embedding method.

label_features = csr_matrix(LabelEmbeddingFactory.create( label_y_ohe_matrix, input_feature_matrix, method="pifa"), dtype=sp.float32)

Do label indexing using HybridIndexer strategy

cluster_matrix = HybridIndexer.gen(feat_mat=label_features, label_strs=labels_unique_sorted, depth=2, max_leaf_size=100, seed=0, max_iter=20, spherical_clustering=True )

this last command above generates error like this below

07/11/2023 16:54:16 - INFO - py4j.java_gateway - Received command c on object id p1 07/11/2023 16:54:16 - INFO - main - Starting Hybrid-Trie Indexing 07/11/2023 16:54:16 - INFO - main - Added all labels to trie. Now building trie till depth = 2

in build_cluster_chain(self, depth) 79 def build_cluster_chain(self, depth): 80 ---> 81 cluster_chain = self._build_sparse_cluster_chain_helper(depth=depth) 82 83 assert len(cluster_chain) == depth + 1 in _build_sparse_cluster_chain_helper(self, depth) 162 par_child_smat = smat.coo_matrix(np.ones((self.n_children, 1))) 163 --> 164 for child_char, child_trie in self.get_children(): 165 child_cluster_chain = child_trie._build_sparse_cluster_chain_helper(depth=depth - 1) 166 all_cluster_chains += [child_cluster_chain] in get_children(self) 29 child_trie._root = child_root 30 assert isinstance(child_trie._root, pygtrie._Node) ---> 31 child_trie._sorted = self._sorted 32 yield child_char, child_trie 33 elif isinstance(self._root.children, pygtrie._OneChild): **AttributeError: 'TrieWrapper' object has no attribute '_sorted'** ## Environment - Operating system: Databricks Cluster Version 11.3 LTS ML - Python version: 3.9 - PECOS version: mainline branch (Add as much information about your environment as possible, e.g. dependencies versions.)
preetbawa commented 1 year ago

I would appreciate feedback on this matter as we are blocked to use HybridIndexing, from what i can tell this attribute is not really used - this code in pecos is in examples path examples/qp2q/models/indices.py

nishant2yadav commented 1 year ago

The bug is a result of pygtrie version mismatch. This code used version 2.4.2 (as indicated in the requirements file) but the newer version of pygtrie (from 2.5.0 onwards) introduced a small change in the base Trie class in pygtrie package. In version 2.4.2, Trie class has an attribute _sorted (This _sorted variable controls whether the Trie children nodes are iterated in a sorted order or not.) In version 2.5.0, this has been replaced with self._iteritems which points to a function that returns a sorted/unsorted list of items.

So there can be two solutions: 1) Use pygtrie version 2.4.2. 2) Replace child_trie._sorted = self._sorted with child_trie.enable_sorting(self._iteritems is self._ITERITEMS_CALLBACKS[1]) in line 43 and line 50 of https://github.com/amzn/pecos/blob/mainline/examples/qp2q/models/indices.py.

Hope this will help resolve the issue!

preetbawa commented 1 year ago

The bug is a result of pygtrie version mismatch. This code used version 2.4.2 (as indicated in the requirements file) but the newer version of pygtrie (from 2.5.0 onwards) introduced a small change in the base Trie class in pygtrie package. In version 2.4.2, Trie class has an attribute _sorted (This _sorted variable controls whether the Trie children nodes are iterated in a sorted order or not.) In version 2.5.0, this has been replaced with self._iteritems which points to a function that returns a sorted/unsorted list of items.

So there can be two solutions:

  1. Use pygtrie version 2.4.2.
  2. Replace child_trie._sorted = self._sorted with child_trie.enable_sorting(self._iteritems is self._ITERITEMS_CALLBACKS[1]) in line 43 and line 50 of https://github.com/amzn/pecos/blob/mainline/examples/qp2q/models/indices.py.

Hope this will help resolve the issue!

thanks Nitin for your response, so what i did to bypass before your response was to do the following:

add init method in TrieWrapper def init(self, *args, *kwargs): super().init(args, **kwargs) self._sorted = None

and then whereever child_trie._sorted was been assigned in the code i just hardcoded to True, i am curious what's impact of traversing children in sorted order or not , especially we are building autocomplete solution as well.

preetbawa commented 1 year ago

irrespective of hardcoding _sorted, will try out your suggestion, thanks so much.

preetbawa commented 1 year ago

Nitin, i have another question, once we build clusters using Hybrid Indexing, how can i visualize those clusters - hierarchical clusters, i want to see which label embeddings are in same cluster, also do label strs as well go into those clusters - how can i compare what set of labels end up in same cluster ?

preetbawa commented 1 year ago

Second question: i am trying to follow example in code path examples/qp2q/models/pecosq2q.py

i am not sure why this is been done in this if else logic

i initially build OneHotEncoding of labels and then convert to csr matrix which is Y here for us, then i am trying to use PIFA embedding with input feature matrix as X do i need to do this part " y[y > 0] = 1. "

line 514 - 519

  if self.weighted_pifa:
                label_features = LabelEmbeddingFactory.pifa(X=X, Y=y)
                y[y > 0] = 1
            else:
                y[y > 0] = 1. 
                label_features = LabelEmbeddingFactory.pifa(X=X, Y=y)

thanks

nishant2yadav commented 1 year ago

i am curious what's impact of traversing children in sorted order or not , especially we are building autocomplete solution as well

I think the default value of _sorted variable is False in v2.4.2. Setting _sorted to true or false should not make a difference in this code because the query strings are sorted (see line 200 ) before inserting them in the trie so both sorted and unsorted order of child nodes should be the same.
I think it is important to have a consistent order for iterating over trie nodes so that columns of final cluster matrix correspond to the right query.

how can i compare what set of labels end up in same cluster

See Line 26 for more details. The dth cluster matrix is of shape: n_{d+1} x n_{d}. If (i,j)-entry is non-zero then it means that node i in level d+1 is a child-node of node j in level d. Each row contains exactly one non-zero entry as each node can have exactly one parent. So, to find which nodes are in cluster j at a given level, look at all rows that have non-zero entries in column j of the corresponding cluster matrix.

line 514 - 519

y is count vector storing number of times a label occurred with datapoint x. For weighted_pifa - feature vectors are computed using a count-weighted aggregation. And once the label features are computed, all the count information is overwritten and y just contains 0/1. If unweighted average is required, then y the count information is removed from y by converting it to 0/1 vector before computing label_features. See paper for more details on pifa method for computing label_features

If y is already a 0/1 vector, then this if-else will not make any difference in your use-case.

preetbawa commented 1 year ago

One more question about saving model - i was able to successfully train model(atleast not shape errors or other errors)

this is code snippet i used

from pecos.xmc.xlinear.model import XLinearModel

**xlinear_model = XLinearModel.train( input_feature_matrix, labels_y_ohe_matrix, cluster_matrix, threads=16, Cp=1.0, Cn=1.0, threshold=0.1

)**

xlinear_model.save("/dbfs/FileStore/pzn_ai/contextualized_autocomplete/model/")

but when i try to save model to disk as shown above i get following error:

INFO - pecos.xmc.base - Training Layer 0 of 3 Layers in HierarchicalMLModel, neg_mining=tfn.. 07/17/2023 17:22:53 - INFO - py4j.java_gateway - Received command c on object id p0 07/17/2023 17:22:53 - INFO - pecos.xmc.base - Training Layer 1 of 3 Layers in HierarchicalMLModel, neg_mining=tfn.. 07/17/2023 17:22:53 - INFO - pecos.xmc.base - Training Layer 2 of 3 Layers in HierarchicalMLModel, neg_mining=tfn.. 07/17/2023 17:22:54 - INFO - py4j.java_gateway - Received command c on object id p0 OSError: [Errno 95] Operation not supported

stack trace: /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pecos/utils/smat_util.py in save_matrix(tgt, mat) 95 elif isinstance(mat, smat.spmatrix): ---> 96 smat.save_npz(tgt_file, mat, compressed=False) 97 else:

/databricks/python/lib/python3.9/site-packages/scipy/sparse/_matrix_io.py in save_npz(file, matrix, compressed) 71 else: ---> 72 np.savez(file, **arrays_dict) 73

<__array_function__ internals> in savez(*args, **kwargs) /databricks/python/lib/python3.9/site-packages/numpy/lib/npyio.py in savez(file, *args, **kwds) 616 """ --> 617 _savez(file, args, kwds, False) 618 /databricks/python/lib/python3.9/site-packages/numpy/lib/npyio.py in _savez(file, args, kwds, compress, allow_pickle, pickle_kwargs) 719 with zipf.open(fname, 'w', force_zip64=True) as fid: --> 720 format.write_array(fid, val, 721 allow_pickle=allow_pickle, /usr/lib/python3.9/zipfile.py in close(self) 1169 self._fileobj.write(self._zinfo.FileHeader(self._zip64)) -> 1170 self._fileobj.seek(self._zipfile.start_dir) 1171 OSError: [Errno 95] Operation not supported During handling of the above exception, another exception occurred: OSError Traceback (most recent call last) OSError: [Errno 95] Operation not supported During handling of the above exception, another exception occurred: OSError Traceback (most recent call last) in () 13 ) 14 ---> 15 xlinear_model.save("/dbfs/FileStore/pzn_ai/contextualized_autocomplete/model/") /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pecos/xmc/xlinear/model.py in save(self, model_folder) 101 with open(f"{model_folder}/param.json", "w", encoding="utf-8") as fout: 102 fout.write(json.dumps(param, indent=True)) --> 103 self.model.save(path.join(model_folder, "ranker")) 104 105 @classmethod /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pecos/xmc/base.py in save(self, folder) 1317 for d in range(self.depth): 1318 local_folder = f"{folder}/{d}.model" -> 1319 self.model_chain[d].save(local_folder) 1320 1321 @classmethod /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pecos/xmc/base.py in save(self, folder) 789 with open("{}/param.json".format(folder), "w") as f: 790 f.write(json.dumps(param, indent=True)) --> 791 smat_util.save_matrix("{}/W.npz".format(folder), self.W) 792 smat_util.save_matrix("{}/C.npz".format(folder), self.C) 793 /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pecos/utils/smat_util.py in save_matrix(tgt, mat) 96 smat.save_npz(tgt_file, mat, compressed=False) 97 else: ---> 98 raise NotImplementedError("Save not implemented for matrix type {}".format(type(mat))) 99 100 OSError: [Errno 95] Operation not supported
preetbawa commented 1 year ago

i wonder if above error is related to again some versioning problem with use on databricks with different python version etc

nishant2yadav commented 1 year ago

Does it say what is the type of the matrix? Perhaps @rofuyu or @OctoberChang might be able to help with this as it looks like an issue with core pecos functionality?

preetbawa commented 1 year ago

let me check why i don't see that error where it shows 'this type not supported' and describes the type, code is there with RaiseNotImplemented but it doesn' show up in logging in databricks

preetbawa commented 1 year ago

matrix W, and C under model_chain element are both sparse csr matrices , why its having issues, is it some other member which is causing a problem ?

preetbawa commented 1 year ago

@rofuyu or @OctoberChang can you guys please shed light on this issue, its blocking us from saving model to disk.

nishant2yadav commented 12 months ago

@preetbawa , I know that you checked that the matrices W and C are sparse_csr matrices but can you share more details about the exact exception being raised here? What exactly does the exception message from Line 98 say? This error would not be raised if the matrix being saved was a scipy sparse_csr matrix.