Open Sandy4321 opened 4 years ago
hdf5 is a compressed file format you should check the number of examples instead of file size I have shared all baselines compared in my papers, see https://github.com/Atomu2014/product-nets and https://github.com/Atomu2014/product-nets-distributed
great thanks a lot but I am looking for really simple python baseline without complicated packages as TF do you have one or do you know somebody who has performance is not important , I try just learn for very beginning ?
Hi, I suggest you can try these packages: xgboost > libfm > libffm search them on the Internet and find the official guide these packages are easy to try since you don't need to touch the model, and the only thing yous should do is just preparing the data and call API / CLI
great so where to get prepossessed Criteo data set? per The original dataset is know as Criteo 1TB click log, in which the CriteoLab has collected 30 days of masked data. We only know there are 13 numerical and 26 categorical features, and there is no feature description released. Thus we name thease features as num_0 ... num_12, and cat_0 ..., cat_25.
Hi, there are 2 download links in the "Download" section of README. The processed dataset only contains 8 days' logs.
zipped ipinyou is 249 MB and uzipeed 1.5 gb in https://drive.google.com/drive/folders/1thXezQbmuS6Q8-AXmrhB0tLM3mybJxVR?usp=sharing
but https://github.com/wnzhang/make-ipinyou-data stated that
After the program finished, the total size of the folder will be 14G.
so it is because hdf5 data in https://drive.google.com/drive/folders/1thXezQbmuS6Q8-AXmrhB0tLM3mybJxVR?usp=sharing so small , or some clarifications needed?
I understand that due to removed the user-tag feature considering leaky problems some data file reduction happen as well
May you please share some baseline code to try this data then everything will be clear