Details of experiments and Reproducibility issue

Hi, your work is very interesting and has been a great source of inspiration to me. As I attempt to replicate the results presented in your paper, I have the following questions:

In your paper, you mention performing one epoch of instruction tuning on LLaMA2-13B, but the instruction dataset used has not been released. However, in your experimental script, the base model is listed as garage-bAInd/Platypus2-70B-instruct, which is confusing. Which base model did you actually use for the experiments reported in your paper? If it was LLaMA2-13B, could you please release the dataset? If the base model is garage-bAInd/Platypus2-70B-instruct, then my results on the beauty dataset are significantly lower than those reported in your paper (See Table 6), as shown in the attached image:

Regarding dataset processing, you state in your paper that “the maximum sequence length is set to 50 for all models on all datasets.” However, upon reviewing your code, I found that the length of user interaction records is not limited to 50; in fact, some exceed 100 in the training data of the dataset. Could this inconsistency in maximum length lead to unfair comparisons?

HestiaSky / E4SRec

Details of experiments and Reproducibility issue #7