chenchongthu / ENSFM

This is our implementation of ENSFM: Efficient Non-Sampling Factorization Machines (WWW 2020)
33 stars 8 forks source link

Experimental setup for comparison with your work #3

Open tommasocarraro opened 3 years ago

tommasocarraro commented 3 years ago

Dear researchers, I'm Tommaso Carraro and I'm working at context-aware recommender systems for one year. I read your paper and it is so interesting. However, I would try to reproduce your experiments and I have some questions:

  1. In 4.1.3 you said that for Last.fm and MovieLens the latest transaction of each user is held out for testing and the remaining data is treated as the training set. However, it seems that you did this split well for MovieLens but not for Last.fm. In fact, the file test.csv contains 12265 rows, but the users on Last.fm are 1000, as reported in Table 2. This means that the test.csv file should contain only 1000 interactions, according to the described procedure. I know you took the same dataset as CFM GitHub. Also, it seems you described the evaluation procedure as in CFM paper. In the past, I tried to ask CFM researchers the same questions but no answers have arrived. Moreover, since MovieLens is not available in CFM GitHub, you used the splitting procedure as explained in CFM paper, in order to reproduce their experiments. I think you applied the right procedure. So, my question is: "Why is there this inconsistency between the two datasets? Why did you not use the same procedure for Last.fm too?".

  2. In 4.1.3 you said you used the leave-one-out evaluation protocol. Let us defining m as the number of the ratings of a user. With leave-one-out, you mean that for each test item you feed the m-1 user's training interactions at the network and compute metrics based on the position of the test item in the recommended list?

Thank you very much, Tommaso Carraro p.s. I also tried to reproduce CFM experiments using their code but the loss went to NaN after about 15 epochs of training.

chenchongthu commented 3 years ago

Hi, thanks for your interest in our work!

For the first question, as you have mentioned, we used the same Last.fm dataset as the CFM paper for objective comparison. The structure of the Last.fm dataset is: User context, Item context, ratings, timestamp. The user context is described by user ID and the last music ID that the user has listened within 90 minutes. The item context includes music ID and artist ID. It seems that the authors of CFM took all the music that the user has listened within 90 minutes as different user contexts. So there are 12265 user contexts for 1000 users in test.csv. We did not use the same procedure as Movielens for Last.fm is because we want to make a fair comparison with CFM.

For the second question, we also used the same evaluation procedure as in the CFM paper. After training the model, for each user context (for Last.fm, 12265 user contexts), the metrics are computed based on the position of the test item context, and the final metrics are the average value of all user context. You can also refer to our code.

tommasocarraro commented 3 years ago

Hi, thank you very much for your answer!

Could you explain the procedure you used for the MovieLens dataset? This question because you said you used a different procedure for Last.fm and MovieLens, but in the paper you said that you used the same one. Did you take one test interaction for each user context? Or did you take one test interaction for each user independently of the context?

Thank you.

tommasocarraro commented 3 years ago

Hi, I have another important question. In the paper you said you used Hit Ratio (hr@k) and NDCG (ndcg@k) as evaluation metrics. However, in your code you computed recall@k and ndcg@k. Specifically, you used the code available at VAE_CF to compute these metrics. This code computes the recall@k and not the hr@k. In fact, hit ratio is 1 if the target item is in the top-k recommended list, 0 otherwise. The code available at CFM computes the hr@k well.

Did you report recall or hit ratio in Table 3?

Thank you!

chenchongthu commented 3 years ago

Hi, thanks for your question!

We use the leave-one-out evaluation protocol, under this setting, recall@k is equal to hit@k. recall is the fraction of the target items that are successfully retrieved, it is also 1 if the target item is in the top-k recommended list since there is only one target item in the test set.

tommasocarraro commented 3 years ago

Hi, thank you for your answer, you are correct.

Could I ask another question? Could you kindly provide the datasets without preprocessing? I mean I need your exact split but instead of the features, I would like to have the context fields not pre-processed. For example, for Frappe I would like the following structure:

0 0 1 morning sunday weekend unknown free sunny United States 0

In fact, I'm not able to derive these fields from the features listed in the csv files. I ask this because the model I'm trying to compare with yours is really different and requires very different pre-processing. For example, it doesn't require the features. Thank you!

chenchongthu commented 3 years ago

In fact, we downloaded the processed datasets Frappe and Last.FM. directly from CFM GitHub, so we also don't have the datasets without preprocessing. Maybe you can ask the authors of CFM for help.

tommasocarraro commented 3 years ago

Thank you very much for the fast reply, I'm getting in touch with CFM authors.

However, I think you could provide me with the MovieLens dataset. I'm sure you have the dataset not preprocessed but already split.

Could you provide me this dataset?

Thank you!

chenchongthu commented 3 years ago

Of course! what's your email address?

tommasocarraro commented 3 years ago

Thank you very much!

E-mail: tommasocarraro96@gmail.com

Could you provide me with some information regarding the pre-processing you used? You can write this information directly in the e-mail.

Thank you!

chenchongthu commented 3 years ago

Ok, we have sent the email. If you have any questions, please feel free to ask me.