Open skleee opened 1 month ago
Hi @skleee ,
Thanks for letting us know about this potential issue. From the attached code block, it appears that the text of the last items is leaked during the generation of user IDs. If this is the case, we greatly appreciate your help in finding this issue, and we'll update the results in the paper if this is confirmed.
In the meantime, I would like to say that, as noted in the paper, using user IDs is an optional choice and does not significantly improve the results (please refer to the attached Table 6 from the paper). Could you please use the method without user IDs for experiments at this time?
If your experiments are correct, I guess using user IDs may even hurt the performance of the method, and solely using generated item IDs leads to the best performance. However, I need some time to double-check and correct the results accordingly. Once again, thank you very much for pointing this out.
Hi @chrisjtan ,
Thank you for your quick response and I'll wait for the newly updated results.
By the way, if I want to train the model without using user IDs, do I have to use prompt_no_user_id.txt
instead of prompt.txt
as the prompt_file argument?
It seems there are some differences between prompt_no_user_id.txt
and prompt.txt
. It seems that the prompt from prompt_no_user_id.txt
set {dataset} {target} as a target, and it looks like {dataset} is included in the prompt as if it's a straightforward task rather than a sequential one.
sequential; seen; Considering {dataset} user has interacted with {dataset} items {history} . What is the next recommendation for the user ?; {dataset} {target}
I'm also curious if any additional configurations need to be changed. Thanks!
Hi @skleee,
Thank you for your patience.
Unfortunately, after fixing the data leakage issue, I found that all the results showed a significant drop, even when not using the user ID at all. I haven't yet figured out which implementation error caused the unusual results previously.
In any case, I have re-run all the experiments without using the user ID, and the initial results are as follows. If you want to compare to the paper as a baseline, please consider using the following numbers.
I would still say that the method may be the best overall compared to other baselines (significantly better on Beauty and Toys than other methods, worse than P5_CID and P5_SemID on Sports, and worse than P5_SID on Yelp). However, the results are clearly much worse than originally reported in the paper, and we need to update not only the table but also some of the claims made in the original paper.
Regarding your question, I’ve updated the prompt file to remove user IDs. I haven’t tried tuning the training parameters yet, but it would be great if you could do so.
Thank you very much for helping us identify this error and preventing it from misleading other researchers.
Hi @chrisjtan,
Thanks for running the experiments and double-checking the performance without using the user ID. I also appreciate your kind response to clear up any errors and confusion :)
Thank you for your interesting work. I believe the idea of aligning LLM and RecSys knowledge via textual IDs is very promising and innovative.
While examining your code, I noticed a data leakage issue in Line 557 of indexing.py:
https://github.com/agiresearch/IDGenRec/blob/543dfecf248a75da10a19c393680bed0c2806db8/src/utils/indexing.py#L556-L570
This line is used to generate a user ID for both training and testing. For example, if the user sequence in user_sequence.txt is:
The textual ID for user
A1YJEY40YUW4SE
is generated by concatenating the textual IDs of all items, includingB004756YJA, B004ZT0SSG, ..., B002WLWX82
.The concern is that in the leave-one-out setting, the model predicts the textual ID of the last item (e.g.,
B002WLWX82
), while the user ID (which includes information about this target item) is used as input during testing. This may leak information about the target item into the user ID.To investigate this, I modified line 557 in indexing.py to exclude the target item when creating the user ID:
text = " ".join(text[:-1])
. Unfortunately, I observed a performance drop on the Beauty dataset:If I have misunderstood any of the implementations, please let me know. I'm looking forward to your response for a clearer understanding of the methodology.