Include Music4All-Onion in the datasets

mmosc commented 2 years ago

Hi!

I modified the code to include a class for the dataset Music4All-Onion. So far, only the .inter file gets converted. One remark: for the file with the timestamp, I selected token as format, since the timestamp is given as date and time, for instance 2013-01-27 21:42:38. Maybe there is a better way?

There are a couple of ToDo's:

Include code for .item and .user files
Include a README with instructions on how to download the dataset and convert it
Upload the converted atomic files to your collection of files

I will work on the .item part in the next days.

Thank you for this great library! Cheers Marta

mmosc commented 2 years ago

The header of the .inter was duplicated, I fixed that in the new commits, and also added the code for the .item conversion. Since there are several files for item features, the code is designed to convert one of them, depending on the filename. I decided so because users might want to download only one feature file, and convert one of them only, instead of all.

Should I create a new pull request?

Meanwhile I will work on the README.

Cheers, Marta

hyp1231 commented 2 years ago

Hi!

I modified the code to include a class for the dataset Music4All-Onion. So far, only the .inter file gets converted. One remark: for the file with the timestamp, I selected token as format, since the timestamp is given as date and time, for instance 2013-01-27 21:42:38. Maybe there is a better way?

There are a couple of ToDo's:

Include code for .item and .user files

Include a README with instructions on how to download the dataset and convert it

Upload the converted atomic files to your collection of files

I will work on the .item part in the next days.

Thank you for this great library! Cheers Marta

Hi!

Thanks for the great contribution! The conversion script looks fine. The only concern is about the type of several columns.

count:token. It seems that this column denotes how many times a user listen to the track. Maybe it could be better to be a float type if the feature is numeric and can be compared. The type token is for some discrete features that are more suitable for lookup embeddings.
timestamp:token. Could the string be converted into UNIX timestamp in the provided scripts for the convenience of comparing and sorting? For example, we can use time.strftime Python APIs. Then we this column could be better in a float type.

Looking forward to include Onion :)

Cheers, Yupeng

hyp1231 commented 2 years ago

The header of the .inter was duplicated, I fixed that in the new commits, and also added the code for the .item conversion. Since there are several files for item features, the code is designed to convert one of them, depending on the filename. I decided so because users might want to download only one feature file, and convert one of them only, instead of all.

Should I create a new pull request?

Meanwhile I will work on the README.

Cheers, Marta

You can directly append the commits in this PR. :) Thanks!!

mmosc commented 2 years ago

Thanks for your feedback :)

I appended the new commits:

fix type of timestamp and counts
add item feature conversion
fix the duplicated header
add README for the conversion

Have a look at let me know!

hyp1231 commented 2 years ago

Look good to me! Thanks so much.

By the way, may I download the processed files somewhere? I can upload them to our storage hubs, e.g., Google Drive. Then I'll merge this PR and update our websites etc.

If not, I can try to convert the original datasets into atomic files, and we can then check the md5 token.

mmosc commented 2 years ago

Thank you @hyp1231 !

The atomic files are not yet ready to download anywhere, since I did not process them all. You can try and convert the original dataset, as you were mentioning,

RUCAIBox / RecSysDatasets

Include Music4All-Onion in the datasets #109