Open mOmUcf opened 5 years ago
Hi, thanks for your interests at first. I may write the wrong number but I cannot remember it precisely. It is also possible we obtain different thresholds under difference pre-processing strategies. Generally, both 10 and 20 are good enough to filter out noise, so I think 10 is ok in your settings.
If this number really matters, maybe you can check the minimum occurrence of features in my processed Avazu dataset, which can be found in README, where the low-frequency categories have already been dropped.
If this number is verified to be 10, I will update this paper on arxiv. Thanks!
Im sorry i do not reply in time, and here are the code and output while i use your data interface:https://github.com/Atomu2014/Ads-RecSys-Datasets
import numpy as np
import pandas as pd
from datasets import Avazu
ava = Avazu()
ava.load_data('train')
ava.load_data('test')
df_avazu = pd.DataFrame(np.vstack([ava.X_train,ava.X_test]) , columns=ava.feat_names)
for field in ava.feat_names:
field_cnt = field+'_cnt'
gbdf = df_avazu.groupby(field).size().reset_index().rename(columns={0: field_cnt})
min_freq = gbdf.sort_values(field_cnt)[field_cnt].values[0]
print(f"{field}'s minimum feature frequence is {min_freq}")
and the outputs are as follow (ignoring the dataloading infomation):
C1's minimum feature frequence is 5787
banner_pos's minimum feature frequence is 2035
site_id's minimum feature frequence is 10
site_domain's minimum feature frequence is 10
site_category's minimum feature frequence is 10
app_id's minimum feature frequence is 10
app_domain's minimum feature frequence is 10
app_category's minimum feature frequence is 16
device_id's minimum feature frequence is 10
device_ip's minimum feature frequence is 10
device_model's minimum feature frequence is 10
device_type's minimum feature frequence is 31
device_conn_type's minimum feature frequence is 42890
C14's minimum feature frequence is 10
C15's minimum feature frequence is 1621
C16's minimum feature frequence is 1621
C17's minimum feature frequence is 12
C18's minimum feature frequence is 2719623
C19's minimum feature frequence is 10
C20's minimum feature frequence is 23
C21's minimum feature frequence is 497
mday's minimum feature frequence is 3225010
hour's minimum feature frequence is 818771
wday's minimum feature frequence is 3225010
Thanks a lot! I will fix it later!
' Product-based Neural Networks for User Response Prediction over Multi-field Categorical Data' (TOIS'17)
In section 5.1.1Datasets of this paper, there says "We randomly split the public dataset into training and test sets at 4:1, and remove categories appearing less than 20 times to reduce dimensionality.", but when i preprocessing the raw avazu dataset by my self, i found that if #categories=6*10^5 in avazu dataset, the threshold need to be 10 , not 20. when i use a threshold 20, #categories< 4*10^5
Is it an incorrect threshold number in section 5.1.1 ?