Intuition about WassersteinVectorizer?

JaimeArboleda commented 1 year ago

Hello! My work involves dealing with similarity search and anomaly detection with multiformat data (that is, we have datasets with numerical columns, categorical columns and textual columns). Both problems can be tackled via suitable embeddings. We have been trying several approaches, one of them using BERT-like transformers after some way of tokenizing everything (for categorical columns, we treated them as words, and, for numerical columns, we binned them and considered as categories). While the approaches were promising, transformers are very slow for real time inference...

And that's when I discovered all your incredible work with UMAP and vectorizers! Since them, I have been trying to migrate my whole pipeline with your libraries. However, I wish I could understand better how WassersteinVectorizer works. I have not found any documentation, and I don't understand yet the relationship between Wasserstein Distance and how it can be used for creating embeddings from bag-of-words objects (where each word has an embedding vector representing it). Is there any intuition, paper or youtube video that I could follow?

Thanks in advance and thanks for sharing your amazing work with the community. :)

lmcinnes commented 1 year ago

So before I start I should note that the WassersteinVectorizer is not going to be super fast for inference (it needs to solve an optimal transport problem). It may be fast enough for your needs, but it is worth being aware of.

The WassersteinVectorizer is based on Linear Optimal Transport (LOT) and essentially implements LOT combined with dimension reduction (since LOT can produce very high dimensional vectors).

How does LOT work? Let's back up a little an just consider using Wasserstein distance between bags of words with associated word vectors. This is actually often called Word Mover Distance and makes a lot of sense -- we want to match up words from one bag with words from another in a way that involves the least amount of "movement/work" through word vector space, and the total amount of movement is the distance between the bags of words. If we assume this is good then the question becomes "can we use a vectorization to approximate this?" and it turns out that the answer is yes.

So our goal is to produce vectors such that the distances between the vectors approximates, or correlates with, what we would get as Wasserstein distances between the bags of words. The intuition you want starts by observing that there is an entire space of bags of word-vectors with Wasserstein distance between them, and in fact that space turns out to be a manifold with lots of nice manifold properties. The property we are lacking is that of being a nice linear vector space. On the other hand the tangent space to the manifold at a point is a linear vector space, and is the optimal linear approximation of the manifold at that point. LOT simply picks a point on the manifold and then projects all the data (bags of word-vectors under Wasserstein distance) onto the tangent space of the manifold at that point. The result is a vector space where Euclidean distances approximate Wasserstein distances between the bags-of-words we started with. The approximation is best when we are close to the reference point we took the tangent plane at, so choosing that well based on the data we have will matter, but that's the gist of it.

The WassersteinVectorizer computes a LOT representation and then does dimension reduction on the result to get the dimensionality down to something manageable, and tries to do smart things in picking a point on the Wasserstein manifold etc. etc. LOT also assumes everything is Euclidean, so we also have handling of vector spaces that use cosine distance (since that's pretty common in word-vectors and other embeddings) and outputs a vector space that uses cosine distance as well. In practice we find this all works surprisingly well.

A final note: If you have large bags of words (i.e. many words in each bag; for our purposes "many" is anything much over 256) then the SinkhornVectorizer is a faster option. It effectively does exactly the same thing as described above but uses Sinkhorn distance instead of Wasserstein distance.

JaimeArboleda commented 1 year ago

Wow, that was an awesome explanation!!! I think I got the main idea clearly (btw, I think Word Mover Distance is not mentioned in the documentation, and it would be nice to put it somewhere because it's the key word to search for an explanation).

I was thinking about using ApproximateWassersteinVectorizer, which is faster (I saw this approach used here). But anyway, I wanted to use something that I could more or less understand, so really, really, thanks a lot for your excellent introduction.

I have a couple of questions, but feel free to answer or not:

I guess using tf-idf or information_weighting is essential, in order to discard unimportant words when computing the distance between bag of words, right?
What do you think is the best approach for categorical and numerical values?

Regarding the second question, my naive idea was binning the numerical columns and, then, treating both numerical columns and categorical columns as "special words" (i.e., tokens, differents than any token that could appear in the text). For example, imagine that I had information about purchases, including good description, weight, price and country of origin. Well, some document like

{
  "good_description": "frozen shrimps",
  "weight": 40,
  "price": 12,
  "country_origin": "IN"
}

Would be, first, represented like a single text:

"frozen shrimp weight_3 price_4 country_IN"

In which every "special word" is treated as a single token.

From this representation, I'd build a token coocurrence matrix (using the whole document as context) and apply svd to get word vectors, following the intuition behind glove and word2vec (similar words must appear in similar contexts).

And with word_vectors and Word Mover Distance I could have embeddings of whole documents. Now, my question is: do you think it makes sense to compare the "special words" (like weight_3; the ones that come from cat/num columns) with every other word, when trying to match words of two documents when computing LOT, or only among the words in the same category?

My intuition is that those special words should be compared only among them (they always exist in each document). For example, imagine that I wanted to compare the previous document with this one:

{
  "good_description": "cellulose from pine trees",
  "weight": 400,
  "price": 72,
  "country_origin": "CN"
}

If, for some weird reason, with vanilla LOT, country_IN ended up paired with cellulose, insted of country_CN, wouldn't it be a worse comparison -a worse distance- than forcing the pair of (country_CN, country_IN), and the same for the rest of "special words" coming from cat/num columns? In other words, wouldn't it make more sense to solve only the LOT problem for the words in the free text column?

One way of achieving this, while using your API instead of creating my custom class, could be adding new extra dimensions for each cat/num column. For example, if word2vec provided 100 dimensions, for country_of_origin I could add an extra 101 dimension, and fill it with 0's for every non-country word and with, say, 99999's, for each country_XX word. Then, country_IN will be matched with country_CN almost surely, because any other pair will be penalized by the vectors.

Do you think this kind of approach makes sense, or am I overcomplicating things?

lmcinnes commented 1 year ago

I think your approach makes sense; it is perhaps overcomplicating things. If you really don't want your words to mix with your other data I think the simplest thing to do would be to use the Wasserstein vectorizer on the bag-of-words data, and then separately deal with the other data, vectorizing that accordingly and get to vectors of a similar dimension and then just concatenate the vectors. It's hard to know beforehand how the whole thing will work, especially for the task you are actually applying it to. It may be worth just trying some of these options (starting with the simplest approach you can manage) and see what sorts of results you get.

JaimeArboleda commented 1 year ago

That's a very sound idea, thanks a lot again.

TutteInstitute / vectorizers

Intuition about WassersteinVectorizer? #114