HKUDS / RLMRec

[WWW'2024] "RLMRec: Representation Learning with Large Language Models for Recommendation"
https://arxiv.org/abs/2310.15950
Apache License 2.0
256 stars 26 forks source link

About data processing #9

Closed 97z closed 3 weeks ago

97z commented 1 month ago

Hi 👋, Thanks for your great work! I try to run your code on my own dataset. For fairness comparison, can you upload the whole data processing code? I find the user number in raw data is more than 10999 even when I filter by 3-cores. And how did you generate usr_prf.pkl, usr_emb_np.pkl, itm_prf.pkl and itm_emb_np.pkl? Thank you for any reply!

Re-bin commented 1 month ago

Hi 👋!

Thank you for your interest! For the Amazon book dataset, we first filter interactions with the rating score of 3. To manage the dataset size, we uniformly sample 20% (25% for Yelp/Steam) of the users, as it was affordable for us to conduct profiling for a few dozen dollars using the GPT-3.5-Turbo API at that time. The good thing is that it is now much more feasible with a cheaper OpenAI API :)

We then apply a 10-core approach to obtain the initial datasets. Following that, we utilize multi-thread processing, as outlined in our code, to conduct profiling and text embedding encoding. Ultimately, we remove any users or items that were not successfully profiled or encoded (sometimes due to internet issues) and format the data into usr_prf.pkl, usr_emb_np.pkl, itm_prf.pkl, and itm_emb_np.pkl.

I hope the above information is helpful for you :)

Best regards, Xubin

97z commented 1 month ago

Hi 👋!

Thank you for your interest! For the Amazon book dataset, we first filter interactions with the rating score of 3. To manage the dataset size, we uniformly sample 20% (25% for Yelp/Steam) of the users, as it was affordable for us to conduct profiling for a few dozen dollars using the GPT-3.5-Turbo API at that time. The good thing is that it is now much more feasible with a cheaper OpenAI API :)

We then apply a 10-core approach to obtain the initial datasets. Following that, we utilize multi-thread processing, as outlined in our code, to conduct profiling and text embedding encoding. Ultimately, we remove any users or items that were not successfully profiled or encoded (sometimes due to internet issues) and format the data into usr_prf.pkl, usr_emb_np.pkl, itm_prf.pkl, and itm_emb_np.pkl.

I hope the above information is helpful for you :)

Best regards, Xubin

I get it! Thank you for your reply!