About the training speed

THUwangcy / ReChorus

“Chorus” of recommendation models: a light and flexible PyTorch framework for Top-K recommendation.

MIT License

558 stars 93 forks source link

About the training speed #29

Closed youngzw closed 2 years ago

youngzw commented 2 years ago

Hi THUwangcy.

I use your lib to run the SASRec model with command as below (cuda environment):

python main.py --model_name SASRec --emb_size 50 --lr 0.001 --l2 0.0 --dataset ml-1m --test_all 1 --history_max 200 --num_layers 2 --num_heads 1 --batch_size 128 --topk 10 --num_workers 2

The parameters are same with the original paper SASRec. But I find the training speed is much slower than the original code for one epoch, even I modify the code to avoid inferring at every epoch. https://github.com/THUwangcy/ReChorus/blob/9a4a783de1fdd02292fb95bc3471b8d310d2110a/src/helpers/BaseRunner.py#L121-L126 Can you provide some solutions to this problem?

youngzw commented 2 years ago

By the way, I wonder why the performance of SASRec model is higher than the original paper. Here is the result with ml-1m dataset:

the original code (I re-run it):
- sample 99 negative items:
- sample 100 negative items:
screenshot of the results reported in the original paper (sample 100 negative items):

rechorus (default sample 99 negative items):

youngzw commented 2 years ago

when ranking over all the items with --test_all 1:

the original code (I modify it):
rechorus:

THUwangcy commented 2 years ago

I checked the original code of SASRec and find it adopts a quite different training paradigm. In our framework, a sequence with length 200 will be fed into the forward function 199 times. Each time the input includes the target item and corresponding history sequence. This is easier to understand and flexible to design more complex models (similar implementations can be found in RecBole). However, the sequence will only be encoded once in the original code of SASRec. It will generate 200 logits corresponding to each position and use 199 of them to calculate the loss. This is far more efficient but requires the model to be able to output all the logits simultaneously. The difference in training paradigm might also lead to inconsistent performance.