ipinyou data size and base line code

Atomu2014 / Ads-RecSys-Datasets

This repository collects some datasets for Ads & RecSys uses, and provide easy-to-use hdf5 iterative access.

89 stars 23 forks source link

ipinyou data size and base line code #8

Open Sandy4321 opened 4 years ago

Sandy4321 commented 4 years ago

zipped ipinyou is 249 MB and uzipeed 1.5 gb in https://drive.google.com/drive/folders/1thXezQbmuS6Q8-AXmrhB0tLM3mybJxVR?usp=sharing

but https://github.com/wnzhang/make-ipinyou-data stated that

After the program finished, the total size of the folder will be 14G.

so it is because hdf5 data in https://drive.google.com/drive/folders/1thXezQbmuS6Q8-AXmrhB0tLM3mybJxVR?usp=sharing so small , or some clarifications needed?

I understand that due to removed the user-tag feature considering leaky problems some data file reduction happen as well

May you please share some baseline code to try this data then everything will be clear

Atomu2014 commented 4 years ago

hdf5 is a compressed file format you should check the number of examples instead of file size I have shared all baselines compared in my papers, see https://github.com/Atomu2014/product-nets and https://github.com/Atomu2014/product-nets-distributed

Sandy4321 commented 4 years ago

great thanks a lot but I am looking for really simple python baseline without complicated packages as TF do you have one or do you know somebody who has performance is not important , I try just learn for very beginning ?

Atomu2014 commented 4 years ago

Hi, I suggest you can try these packages: xgboost > libfm > libffm search them on the Internet and find the official guide these packages are easy to try since you don't need to touch the model, and the only thing yous should do is just preparing the data and call API / CLI

Sandy4321 commented 4 years ago

great so where to get prepossessed Criteo data set? per The original dataset is know as Criteo 1TB click log, in which the CriteoLab has collected 30 days of masked data. We only know there are 13 numerical and 26 categorical features, and there is no feature description released. Thus we name thease features as num_0 ... num_12, and cat_0 ..., cat_25.

Atomu2014 commented 4 years ago

Hi, there are 2 download links in the "Download" section of README. The processed dataset only contains 8 days' logs.