Yueeeeeeee / LlamaRec

[PGAI@CIKM 2023] PyTorch Implementation of LlamaRec: Two-Stage Recommendation using Large Language Models for Ranking
135 stars 26 forks source link

Reproducibility Challenges #1

Closed GarciaLnk closed 8 months ago

GarciaLnk commented 8 months ago

I've been working on replicating the findings from the LlamaRec paper, but I've encountered several challenges that I'd like to share. If necessary, I can split these into separate issues:

  1. Dependency Installation: The requirements.txt provided can't be used to create a conda environment, as it includes pip-only dependencies. While manually installing the latest versions is a workaround, it will not match the original development environment and may lead to issues in the future. Providing an environment.yml would be ideal.

  2. Baseline Code Absence: There's no provided code or documentation for the baseline comparisons. Given that different implementations can yield varying results, having access to the baseline code used would really help with reproducibility.

  3. Metrics Discrepancy: The metrics for the retriever are consistently lower than reported in the paper, which also affects the overall performance of LlamaRec negatively. These discrepancies on the LRURec metrics can be reproduced on Colab with this notebook. I've also included a comparative table below for reference:

    ML-100k Beauty Games
    LRURec Paper Obtained Paper Obtained Paper Obtained
    M@5 0.0390 0.0340 0.0376 0.0364 0.0533 0.0507
    N@5 0.0468 0.0394 0.0435 0.0419 0.0640 0.0604
    R@5 0.0705 0.0557 0.0614 0.0589 0.0966 0.0900
    M@10 0.0491 0.0389 0.0417 0.0398 0.0598 0.0569
    N@10 0.0705 0.0514 0.0533 0.0504 0.0800 0.0756
    R@10 0.1426 0.0934 0.0916 0.0853 0.1463 0.1373

I'd appreciate any help to address these issues.

Yueeeeeeee commented 8 months ago

Thanks for yout interest in our work! For reproducing our LRURec perofrmance please perform hyperparameter searches with weight decay from [0, 1e-2] and dropout rate from [0.1, 0.2, 0.3, 0.4, 0.5], an example can be found at the bottom of here. As for the environment and baseline issues, I will organize our implementation and update the repo within a few weeks:)

Best Zhenrui

elloza commented 7 months ago

Hello @Yueeeeeeee!

Congratulations for your work!

I was asking myself the same question about the implementations of the baselines methods used (NARM, SASRec and BERT4Rec).

Could you point us to any open implementations you have used and the hyperparameters chosen?

Thank you very much for the help!

KpiHang commented 4 months ago

I use this way:https://github.com/Yueeeeeeee/LlamaRec/issues/1#issuecomment-1949472178,the retrieval model LRURec achieved good results.

But when using this LRURec to train Ranker, the effect of Ranker is not good. Here is my results, in beauty dataset:

{
    "test_Recall@10": 0.08942324860472203,
    "test_MRR@10": 0.0392304892349303,
    "test_NDCG@10": 0.050945015045320945,
    "test_Recall@5": 0.060988716312725656,
    "test_MRR@5": 0.03545584686710355,
    "test_NDCG@5": 0.041770513622367465,
    "test_Recall@1": 0.021672936315532656,
    "test_MRR@1": 0.021672936315532656,
    "test_NDCG@1": 0.021672936315532656,
}

Results in the paper:

image

I'd appreciate any help to address these gap.

@Yueeeeeeee

Yueeeeeeee commented 4 months ago

I use this way:https://github.com/Yueeeeeeee/LlamaRec/issues/1#issuecomment-1949472178,the retrieval model LRURec achieved good results.

But when using this LRURec to train Ranker, the effect of Ranker is not good. Here is my results, in beauty dataset:

{
    "test_Recall@10": 0.08942324860472203,
    "test_MRR@10": 0.0392304892349303,
    "test_NDCG@10": 0.050945015045320945,
    "test_Recall@5": 0.060988716312725656,
    "test_MRR@5": 0.03545584686710355,
    "test_NDCG@5": 0.041770513622367465,
    "test_Recall@1": 0.021672936315532656,
    "test_MRR@1": 0.021672936315532656,
    "test_NDCG@1": 0.021672936315532656,
}

Results in the paper:

image

I'd appreciate any help to address these gap.

@Yueeeeeeee

Could you share the config you used to train on the Beauty dataset? Could you also share the LRURec config and results you used so I can reproduce the issue? Thanks!

voyage-ing commented 3 months ago

I have the same question.

When using LlamaRec(LRURec + Ranker), the performance decreases after training the Ranker compared to the LRURec. Why does this happen?

I just used configurations provided in config.py. Could you please share the implementation details from your paper?

Thank you very much. @Yueeeeeeee

Yueeeeeeee commented 3 months ago

I have reproduced the numbers on ML-100k and am workin on the Beauty dataset, I will share a new script once the exps are done, thanks again for your interest in our work!

Yueeeeeeee commented 3 months ago

Hi, just using the default config, I was able to achieve a comparable performance to the paper: { "test_Recall@10": 0.09708042129722096, "test_MRR@10": 0.04098382929298969, "test_NDCG@10": 0.05406365780212878, "test_Recall@5": 0.06528747864848244, "test_MRR@5": 0.0367693631704106, "test_NDCG@5": 0.04381075943432336, }

If you still cannot get a comparable performance, I can also update an improved negative sampling strategy that further improves the ranking performance:)

KpiHang commented 3 months ago

@Yueeeeeeee May I ask what are the hyperparameters of the LRURec you are using, decay=?dropout=?

Thank you.

Yueeeeeeee commented 3 months ago

Sure, I select with the highest Recall@20, in my case it's 0.01 decay and 0.5 dropout:)

KpiHang commented 3 months ago

This is different from the default configuration you provided,

https://github.com/Yueeeeeeee/LlamaRec/blob/48b288b23197b57d564955753c69ac2baa4c1dad/config.py#L48-L49

Is that so? what about args.rerank_best_metric

    args.best_metric = 'Recall@20'
    args.rerank_best_metric = 'NDCG@10'

It seems that you did not mention the negative sampling strategy in the paper, is it a full sort?

I really need this, thank you very mush ! My email : hangkees@aliyun.com https://github.com/Yueeeeeeee/LlamaRec/issues/1#issuecomment-2179404731

Yueeeeeeee commented 3 months ago

Thanks for the response,

In my case I the best Recall@10 and the best Recall@20 is the same model, with weight decay 0.01 and dropout 0.5. I am testing the popularity-based sampling and will upload to the repo in the next days:)