HestiaSky / E4SRec

MIT License
37 stars 7 forks source link

The performance gap compared to P5 #1

Closed woriazzc closed 8 months ago

woriazzc commented 9 months ago

Hi! It's great work and thanks for publishing the code.

However, according to the results in your paper, it seems that the performance is much worse than P5 "Recommendation as Language Processing (RLP): A Unified Pretrain, Personalized Prompt & Predict Paradigm (P5)", especially on Yelp (H@5 0.0266 v.s. 0.0574) and Sports (H@5 0.0281 v.s. 0.0387). I have checked that the statistics of the datasets and the evaluation methods are the same in your paper and P5. So, I am really curious about the reason behind the huge performance gap. Is there any difference in design choices or implementations?

HestiaSky commented 9 months ago

Thanks for your interest of our work. The original implementation of P5 is defective. Please refer to their later version of P5, which is called OpenP5[1] https://github.com/agiresearch/OpenP5/ and their later work focusing on IDs[2] https://github.com/Wenyueh/LLM-RecSys-ID/. The fair comparison should be conducted under the same settings, including data split, input features and evaluation protocol.

[1] OpenP5: Benchmarking Foundation Models for Recommendation [2] How to Index Item IDs for Recommendation Foundation Models