Closed rumi-09 closed 4 years ago
CUDA runs out of memory because currently I allocate a learnable embedding vector for every item. When the hidden dimension is large, the entire embedding matrix cannot fit into GPU.
One possible solution is to use sparse updates as in #1676 , which keeps the embedding on CPU and updates it with sparse optimizers.
Also, the reason why accuracy seems low is multiple. For one, this implementation is actually an unsupervised learning model that learns item embeddings, and recommends to a user by finding items similar to the latest item he/she has interacted with. This may not be ideal for this particular dataset or task. For one option, finding items similar to the average of all historical item representations (like FISM) the user has interacted with may yield better results, and of course there exists many other options.
Thank you for the fix. Now the code runs for a higher hidden dimension. However, can you tell me why you allocate a learnable embedding for every item? In the Pinsage paper (https://arxiv.org/abs/1806.01973), they say that their system does not allow to directly learn embeddings of nodes as it makes the number of parameters linear with the size of the graph. Is there anything I am missing?
Thank you for the fix. Now the code runs for a higher hidden dimension. However, can you tell me why you allocate a learnable embedding for every item? In the Pinsage paper (https://arxiv.org/abs/1806.01973), they say that their system does not allow to directly learn embeddings of nodes as it makes the number of parameters linear with the size of the graph. Is there anything I am missing?
Pinterest dataset where PinSAGE is developed has richer and more representative features, e.g. images and texts. However, entity features in most recommender system datasets such as Nowplaying-RS or MovieLens are often simply categorical and numeric, which are arguably less representative. In the latter case, attaching a learnable embedding can usually improve the performance of recommendation.
Also, the reason why accuracy seems low is multiple. For one, this implementation is actually an unsupervised learning model that learns item embeddings, and recommends to a user by finding items similar to the latest item he/she has interacted with. This may not be ideal for this particular dataset or task. For one option, finding items similar to the average of all historical item representations (like FISM) the user has interacted with may yield better results, and of course there exists many other options.
Are learnable item embeddings a requirement for unsupervised learning? I think unsupervised learning is training with edges in the graphs (without knowing the node labels) which can be done with learnable or unlearnable embeddings. Also if item embeddings are learnable then how can the PinSage be inductive (i.e. extends to unseen nodes)? In this paper, (https://arxiv.org/pdf/1603.08861.pdf), it says that, "since the embeddings are learned based on the graph structure, the above method is transductive, which means we can only predict instances that are already observed in the graph at training time."
Both are very good questions.
Are learnable item embeddings a requirement for unsupervised learning? I think unsupervised learning is training with edges in the graphs (without knowing the node labels) which can be done with learnable or unlearnable embeddings.
No, learnable item embeddings and unsupervised learning are not related. The reason to include learnable item embeddings here is because the node features we have in traditional recsys datasets like MovieLens or Nowplaying-RS is not enough to distinguish different items. If you have rich features, then you don't need learnable item embeddings.
Also if item embeddings are learnable then how can the PinSage be inductive (i.e. extends to unseen nodes)? In this paper, (https://arxiv.org/pdf/1603.08861.pdf), it says that, "since the embeddings are learned based on the graph structure, the above method is transductive, which means we can only predict instances that are already observed in the graph at training time."
PinSage in the original paper is an inductive model. However, to make PinSage work well on MovieLens or Nowplaying-RS I adapted it to become transductive. IMO the vanilla PinSage is not very well suited to those datasets.
My another personal opinion is that inductive GNNs with less informative features is in general a hard problem, and there are some efforts such as IGMC which try to maintain inductiveness.
❓ Questions and Help
I am trying to run the Pytorch experiment on the Pinsage model. For nowplaying_rs dataset, if I use hidden dimension 1024, "CUDA out of memory" error occurs. For smaller dimensions the code runs without error however, the accuracy seems very low. How to fix the error and any idea to improve the accuracy?