User-Item Embedding - Githubissues

a-agmon commented 3 years ago

Hello, Thank you very much for this work. The performance of your algorithm is stunning! We are testing Cleora for a user-items embedding task. I have run into some result and wondering if this is by design or my mistake. My TSV file is simple and follows the format of "user item"

u1 <\t> i1 u2 <\t> i1 u1<\t> i2 u3 <\t> i2

As you can see, the relation between users and items is many to many. Im running a simple embedding task ./cleora --input ~/test.tsv --columns="users items" --dimension 32 --number-of-iterations 4

In the resulting embeddings it seems that users and items are "remote" from each other, as in the image below (cluster 0 is users and 1 is items). That is very different than cases in which we used simple matrix factorization, where we saw users are closer to the items they buy than other items, but here it seems that these relationships are somewhat lost. Does my question make sense? Is this result expected in this case?

Many thanks!

kmichael08 commented 3 years ago

I noticed the same issue. I tried treating this as one column with a complex modifier which makes the result different but not sure why there is a difference. Waiting for the answer on this thread.

ponythewhite commented 3 years ago

@a-agmon @kmichael08

Great question, thanks for asking!

An explanation

Cleora does not work in a way similar to Matrix Factorization & similar models. The problem you've noticed is actually a geometry-preserving feature of Cleora ;-)

Notice that your graph is bipartite - and this property is captured in the embeddings you get as a result (which can be seen on your 2D projection). Users can easily be similar to each other - if they interact with similar items. Items can easily be similar to each other - if they interact with similar users. But for an item to be similar to a user is hard, because (while they may be directly linked by an edge), their Nth degree neighborhood landscapes are entirely different. In every iteration a node's embedding is replaced by an L2 normalized average embedding of its neighbors, so you can probably imagine what happens in the case of a bipartite graph - a space swapping effect.

A solution

To get the behavior you expect, you can pick any of the 3 options:

item embeddings with iteration K and user embeddings with iteration K+1 (users are aggregates composed of items)
user embeddings with iteration K+1 and user embeddings with iteration K (items are aggregates composed of users)
average embeddings (L2-normalize after averaging) a from iterations K and K+1 for both users and items (something in between)

This can definitely work.

Why modeling users and items in the same space is a bad idea

The approach of jointly modeling user-item interactions is an idea transplanted from MF algorithms. It sucks for a few reasons:

there is no good reason to assume that users and items "live in the same space"
the space of users and the space of items may have very different cardinalities (e.g. 10K items and 300M users)
there are often better ways to group items into sequences than just "users" (e.g. sessions, shopping baskets, etc)
making them live in the same space forces you to use the same embedding size and number of iterations

If you have 1K different items, you can reasonably expect a 32-dim vector to be able to capture item similarities and dissimilarities. But if a user has interacted with a K-item subset out of those 1K, you can't reasonably expect a 32-dim vector to be able to express all the possible subsets via simple addition/averaging. You'd need a strongly non-linear transform to compress this information.

For user-item recommendation purposes, we're feeding Cleora embeddings into EMDE to aggregate user profiles. You can check out some example code here.

General remark: for any reasonable purposes --dimension 32 is usually too low. Cleora is fast, --dimension 1024 would be a reasonable starting point.

Does this clarify the issue?

Best regards,

Jacek

a-agmon commented 3 years ago

Thank you very much for the detailed answer @ponythewhite - your comment is very helpful. I will surely check you sample code using EMDE.

BaseModelAI / cleora

User-Item Embedding #29

An explanation

A solution

Why modeling users and items in the same space is a bad idea