agiresearch / IDGenRec

Towards LLM-RecSys Alignment with Textual ID Learning
26 stars 2 forks source link

Concerns about data leakage #1

Open skleee opened 1 month ago

skleee commented 1 month ago

Thank you for your interesting work. I believe the idea of aligning LLM and RecSys knowledge via textual IDs is very promising and innovative.

While examining your code, I noticed a data leakage issue in Line 557 of indexing.py:

https://github.com/agiresearch/IDGenRec/blob/543dfecf248a75da10a19c393680bed0c2806db8/src/utils/indexing.py#L556-L570

This line is used to generate a user ID for both training and testing. For example, if the user sequence in user_sequence.txt is:

A1YJEY40YUW4SE B004756YJA B004ZT0SSG B0020YLEYK 7806397051 B002WLWX82

The textual ID for user A1YJEY40YUW4SE is generated by concatenating the textual IDs of all items, including B004756YJA, B004ZT0SSG, ..., B002WLWX82.

The concern is that in the leave-one-out setting, the model predicts the textual ID of the last item (e.g., B002WLWX82), while the user ID (which includes information about this target item) is used as input during testing. This may leak information about the target item into the user ID.

To investigate this, I modified line 557 in indexing.py to exclude the target item when creating the user ID: text = " ".join(text[:-1]). Unfortunately, I observed a performance drop on the Beauty dataset:

# original
2024-07-23 10:41:34,935 - root - INFO - hit@5: 0.06340547308173851
2024-07-23 10:41:34,935 - root - INFO - hit@10: 0.08316937935968521
2024-07-23 10:41:34,935 - root - INFO - ndcg@5: 0.04870970548871265
2024-07-23 10:41:34,936 - root - INFO - ndcg@10: 0.05508118366648071

# revised
2024-07-24 09:10:26,463 - root - INFO - hit@5: 0.04167411912001431
2024-07-24 09:10:26,463 - root - INFO - hit@10: 0.059202289393668395
2024-07-24 09:10:26,463 - root - INFO - ndcg@5: 0.030075878170261906
2024-07-24 09:10:26,463 - root - INFO - ndcg@10: 0.03570428946340885

If I have misunderstood any of the implementations, please let me know. I'm looking forward to your response for a clearer understanding of the methodology.

chrisjtan commented 1 month ago

Hi @skleee ,

Thanks for letting us know about this potential issue. From the attached code block, it appears that the text of the last items is leaked during the generation of user IDs. If this is the case, we greatly appreciate your help in finding this issue, and we'll update the results in the paper if this is confirmed.

In the meantime, I would like to say that, as noted in the paper, using user IDs is an optional choice and does not significantly improve the results (please refer to the attached Table 6 from the paper). Could you please use the method without user IDs for experiments at this time?

Screenshot 2024-08-01 at 8 43 09 AM

If your experiments are correct, I guess using user IDs may even hurt the performance of the method, and solely using generated item IDs leads to the best performance. However, I need some time to double-check and correct the results accordingly. Once again, thank you very much for pointing this out.

skleee commented 1 month ago

Hi @chrisjtan ,

Thank you for your quick response and I'll wait for the newly updated results.

By the way, if I want to train the model without using user IDs, do I have to use prompt_no_user_id.txt instead of prompt.txt as the prompt_file argument?

It seems there are some differences between prompt_no_user_id.txt and prompt.txt. It seems that the prompt from prompt_no_user_id.txt set {dataset} {target} as a target, and it looks like {dataset} is included in the prompt as if it's a straightforward task rather than a sequential one.

sequential; seen; Considering {dataset} user has interacted with {dataset} items {history} . What is the next recommendation for the user ?; {dataset} {target}

I'm also curious if any additional configurations need to be changed. Thanks!

chrisjtan commented 4 weeks ago

Hi @skleee,

Thank you for your patience.

Unfortunately, after fixing the data leakage issue, I found that all the results showed a significant drop, even when not using the user ID at all. I haven't yet figured out which implementation error caused the unusual results previously.

In any case, I have re-run all the experiments without using the user ID, and the initial results are as follows. If you want to compare to the paper as a baseline, please consider using the following numbers.

I would still say that the method may be the best overall compared to other baselines (significantly better on Beauty and Toys than other methods, worse than P5_CID and P5_SemID on Sports, and worse than P5_SID on Yelp). However, the results are clearly much worse than originally reported in the paper, and we need to update not only the table but also some of the claims made in the original paper.

Screenshot 2024-08-14 at 8 45 26 PM

Regarding your question, I’ve updated the prompt file to remove user IDs. I haven’t tried tuning the training parameters yet, but it would be great if you could do so.

Thank you very much for helping us identify this error and preventing it from misleading other researchers.

skleee commented 3 weeks ago

Hi @chrisjtan,

Thanks for running the experiments and double-checking the performance without using the user ID. I also appreciate your kind response to clear up any errors and confusion :)