Z-Y-Zhang / one_epoch_phenomenon

This is the implementation of `Towards Understanding the Overfitting Phenomenon of Deep Click-Through Rate Models`, which is accepted by CIKM2022. The codes will be released soon.
24 stars 6 forks source link

Some question regarding data preprocess #2

Open zzh1024 opened 1 year ago

zzh1024 commented 1 year ago

Hi Zhao-Yu, Congrats for the nice work! I am Zihan Zhao from snapchat and want to reproduce your paper based on your codebase. I generate train/test data by the prepare_data.sh but the generated data is not align with paper claimed.

A few questions regarding the code:

  1. When we do split by user, why we consume local_test instead of local_train? https://github.com/Z-Y-Zhang/one_epoch_phenomenon/blob/main/preprocess/split_by_user.py#L3 As a result, each user would only appear 2 times, right?
  2. When we prepare train and test, why we split them by user? https://github.com/Z-Y-Zhang/one_epoch_phenomenon/blob/main/preprocess/split_by_user.py#L13 In our production setup, we usually train 1 to N days and use N+1, N+2 days to eval, this is also how paper do experiment on Taobao's prod traffic, right?
  3. In book_gen_data.py https://github.com/Z-Y-Zhang/one_epoch_phenomenon/blob/main/preprocess/book_gen_data.py#L36-L37, why don't we also map neg_item, neg_cate and why we assign them as "0"?

Thanks in advance!

My run output: ('n user:', 3776, ' n item:', 100305, 'n cate:', 946) ('all_id number:', 105027) ('all_id number:', 105029) all_id_number: 105029

python script/cal_occurrence.py

('\n', 'uid', ' id numbers:', 3776, ' entropy:', 11.88264304936195, ' mean:', 2.0, ' std:', 0.0) 7552.0 ('\n', 'item id', ' id numbers:', 7011, ' entropy:', 12.724103254159509, ' mean:', 1.0771644558550848, ' std:', 0.3484493083151777) 7552.0 ('\n', 'item cate', ' id numbers:', 250, ' entropy:', 1.95246812495099, ' mean:', 30.208, ' std:', 364.70429218203617)

Paper claimed: Screen Shot 2022-11-02 at 11 41 37 PM

Z-Y-Zhang commented 1 year ago

Hi Zihan, Sorry for the inconvenience. We follow the codes provided by DIEN(https://github.com/mouna99/dien) and mainly focus on model implementation. I will rerun the code carefully and give you a reply within three days.

zzh1024 commented 1 year ago

Thank you so much!

On Thu, Nov 3, 2022 at 9:50 PM Z-Y-Zhang @.***> wrote:

Hi Zihan, Sorry for the inconvenience. We follow the codes provided by DIEN( https://github.com/mouna99/dien) and mainly focus on model implementation. I will rerun the code carefully and give you a reply within three days.

— Reply to this email directly, view it on GitHub https://github.com/Z-Y-Zhang/one_epoch_phenomenon/issues/2#issuecomment-1302966238, or unsubscribe https://github.com/notifications/unsubscribe-auth/A23VZLTQY5H5JHSB4NKCNCDWGSIZDANCNFSM6AAAAAARWNSLXM . You are receiving this because you authored the thread.Message ID: @.***>

Z-Y-Zhang commented 1 year ago

Hi Zihan,

I check the details and re-running the code. I can reproduce the results in the paper. Our code is developed with python2.7, I don't know if whether this causes your replication failure.

As for your questions, below is my understanding

  1. I think the implementation of "split_by_user" is correct. Here is the data processing method from DIEN(https://github.com/mouna99/dien): "we regard reviews as behaviors, and sort the reviews from one user by time. Assuming there are T behaviors of user u, our purpose is to use the T-1 behaviors to predict whether user u will write reviews that shown in T-th review." For one sample, "local_test" is the T-th review (label) and "local_train" is the T-1 behaviors (history sequence).
  2. Yes, this experiment of amazon book data set is somewhat different from production practice. I think the reason may be because this dataset has no negative samples and the data amount is small, so previous researchers designed this way to organize the CTR prediction task. As for the experiments on our production dataset, as you said, we train 1 to N days and use N+1 days to eval, which is consistent with the CTR prediction practice in industry.
  3. Because we do not use negative history sequence. We try to explore a more general conclusion, so we simplify the model structure. Besides, I believe that the experiment results are similar with negative samples.

My running results (because of the random seed, it may be slightly different from the results in the paper, but this will not affect the one-epoch phenomenon) ('\n', 'uid', ' id numbers:', 75008, ' entropy:', 16.194756854400666, ' mean:', 2.0, ' std:', 0.0) 150016.0 ('\n', 'item id', ' id numbers:', 85356, ' entropy:', 15.873252717935602, ' mean:', 1.7575331552556352, ' std:', 2.359635418368589) 150016.0 ('\n', 'item cate', ' id numbers:', 969, ' entropy:', 2.0539117856807487, ' mean:', 154.81527347781218, ' std:', 3663.1153233231494) 75008.0 ('\n', 'label', ' id numbers:', 1, ' entropy:', 0.0, ' mean:', 75008.0, ' std:', 0.0) 6715494.0 ('\n', 'hist_item', ' id numbers:', 347025, ' entropy:', 17.144991642269154, ' mean:', 19.351614437000215, ' std:', 45.46973538029079) 6715494.0 ('\n', 'hist_cate', ' id numbers:', 1573, ' entropy:', 1.735195107055087, ' mean:', 4269.2269548633185, ' std:', 136465.78509802683) 13956044.0 ('\n', 'all', ' id numbers:', 426033, ' entropy:', 10.586875649846766, ' mean:', 32.75812906511937, ' std:', 8471.796803175563)

Feel free to contact me if you have any questions.

Best, Zhao-Yu

zzh1024 commented 1 year ago

"we regard reviews as behaviors, and sort the reviews from one user by time. Assuming there are T behaviors of user u, our purpose is to use the T-1 behaviors to predict whether user u will write reviews that shown in T-th review." For one sample, "local_test" is the T-th review (label) and "local_train" is the T-1 behaviors (history sequence).

This is the really the part I get pretty confused. If you really did what you intended to do: We should build train (local_train_splitByUser, T-1 actions) datasets from local_train and test/validation (local_eval_splitByUser) datasets from local_test (T actions) instead of building local_train_splitByUser, local_eval_splitByUser all from local_test.

I am re-running sh prepare_data.sh Hope get lucky this time

regarding the environment: I would suggest you give a recommended install command for windows, mac or linux. Here is my environment setup command on gcp:

git clone https://github.com/Z-Y-Zhang/one_epoch_phenomenon.git
conda create -n python2 python=2.7
conda install tensorflow=1.4
conda install -c anaconda scikit-learn
conda install pandas
zzh1024 commented 1 year ago

Just rerun

sh prepare_data.sh

('all ids of train set:', 912965)
('n user:', 74944, '  n item:', 349364, 'n cate:', 1576)
('all_id number:', 425884)
('all_id number:', 425886)
all_id_number: 425886
train
test

The stats align with your results.