bolcom / serenade-experiments-sigmod

Apache License 2.0
1 stars 1 forks source link

How to start from raw datasets #1

Closed GresgentonG closed 1 year ago

GresgentonG commented 1 year ago

Hi, we are a research group interested in reproducing your result and running your algorithm on more public datasets. However we are encountering the some issues when handling raw dataset. For example, about the retailrocket dataset, we do manage to get the create_serenade_indexes.py running on the transformed data you provided, which contains a column called SessionId, with the following modification to the code:

modifications - config spark to be able to handle `avro` files ```diff @@ -32,43 +32,46 @@ def create_serenade_indexes(catalog_input_dir, qty_lookback_days, end_date, base **app_settings) spark = (SparkSession.builder + .config("spark.jars.packages", "org.apache.spark:spark-avro_2.12:3.3.1") .getOrCreate()) sc = spark.sparkContext ``` - ignoring the pipeline ```diff @@ -51,24 +52,26 @@ def create_serenade_indexes(catalog_input_dir, qty_lookback_days, end_date, base context_path = base_input_dir + '/{' + ','.join(prev_days) + '}' print('context_path:' + str(context_path)) - events10 = sql_context.read.format("avro").load(context_path) - events10 = events10.filter((fn.col('qty_detailpage') + fn.col('qty_purchased')) > fn.lit(0)) - events10 = events10.withColumnRenamed('item_id', 'ItemId') + events10 = sql_context.read.option("header", True).csv(base_input_dir, sep="\t", inferSchema=True) + events10.printSchema() + # events10 = sql_context.read.format("avro").load(context_path) + # events10 = events10.filter((fn.col('qty_detailpage') + fn.col('qty_purchased')) > fn.lit(0)) + # events10 = events10.withColumnRenamed('item_id', 'ItemId') events10 = events10.withColumn('ItemId', fn.col('ItemId').cast(LongType())) - events10 = events10.withColumn('Time', (fn.col('timestamp') / fn.lit(1000)).cast(IntegerType())) # milliseconds to seconds conversion - events10 = events10.drop(*['timestamp']) - - my_pipeline = Pipeline(stages=[SelectCommerciallyViableCustomers(), - Sessionizer(), - MostRecentSessionItem(), - # minimum support for session, item, session. - MinimumColumnSupport(inputCol='SessionId', min_support=params.min_session_length), - MinimumColumnSupport(inputCol='ItemId', min_support=params.min_item_support), - MinimumColumnSupport(inputCol='SessionId', min_support=params.min_session_length), - ]) - full_df = my_pipeline.fit(events10).transform(events10) + events10 = events10.withColumn('Time', (fn.col('Time') / fn.lit(1000)).cast(IntegerType())) # milliseconds to seconds conversion + # events10 = events10.drop(*['timestamp']) + + # my_pipeline = Pipeline(stages=[SelectCommerciallyViableCustomers(), + # Sessionizer(), + # MostRecentSessionItem(), + # # minimum support for session, item, session. + # MinimumColumnSupport(inputCol='SessionId', min_support=params.min_session_length), + # MinimumColumnSupport(inputCol='ItemId', min_support=params.min_item_support), + # MinimumColumnSupport(inputCol='SessionId', min_support=params.min_session_length), + # ]) + # full_df = my_pipeline.fit(events10).transform(events10) # Rename SessionId to VisitId to have more distinctive column names when exporting the sessionindex. - full_df = full_df.withColumnRenamed('SessionId', 'VisitId') + full_df = events10.withColumnRenamed('SessionId', 'VisitId') ``` - ignore everything about the catalog (which should be provided by `catalog_input_dir`), and the harry potter things

But we are not sure if everything is correct, since we are running the code on the transformed retailrocket data provided by you, and still do not know how to transform the original retailrocket data into the form that can be accepted by the algorithm. To be more specific, the Time and ItemId columns seems to be existing in the original dataset, but SessionId seems to be missing.

It could be very helpful if you can give us some hint on this, thanks!

bkersbergen commented 1 year ago

Hi @GresgentonG

1) preprocessing public data Transforming raw public datasets into preprocessed ones is outside the scope of our work. The repo from Ludewig et al contains code to preprocess public datasets using pandas.

2) offline index creation For datasets up to 200M rows its perfectly fine to use the CSV reader in Serenade. You don't need to create an index offline. The offline index creation is only needed in our production setup where we are handling very large clickdatasets that do not fit on a single harddisk. We shared our Spark code to do this.

Let me know if you need help with setting up the experiments.

GresgentonG commented 1 year ago

Hi @bkersbergen,

Thank you very much for your quick reply! May I just confirm our understandings below:

  1. For the smaller public datasets, we can feed csv file with Time, ItemId, SessionId columns directly to the serenade server, without the need to compute the index offline using the pyspark code

  2. The catalog file is only used for your private production data

Besides, we would also like to ask some more follow-up questions below:

  1. How to generate the SessionId column for public datasets?

    • Is this where we need to make use of the Sessionizer? If it is this case, we also noticed that there is a bui column mentioned in the code of the sessionizer, which is somehow used as unique user identifiers to our understanding. But for example the retailrocket dataset does not contain such related information, so how should we handle this?
    • Or that we need to look into the code by Ludewig et al? If it is this case, we would appreciate it a lot if you could give us some information on how to quickly start using their code to generate the SessionId column, or more generally to preprocess more public datasets.
  2. Is it possible if we still want to compute the offline index for smaller public datasets (even if it is not necessary)? If so,

    1. can we safely ignore the things about the catalog file?
    2. And can we anticipate that we will get some performance improvements if we compute the index offline instead of directly feeding them into the serenade server? In other words, will there be any limitation if we feed smaller datasets directly? If we directly feed the csv file into the serenade server, are we just running the vs-knn instead of the vmis-knn, since we do not have the index pre-computed? Or there will still be some indexes computed on-the-fly when we provide the csv files?

Thank you!

bkersbergen commented 1 year ago

Hi,

1) Correct. The happy-flow is to have Serenade read a csv file when training the model. It then internally computes its efficient index. 2) Correct. When using CSV files this code is not executed. 3) I can help you in that. 4) Yes you can but I would not advise it. There is no performance gain during inference if precompute the index or use the CSV reader. VS-kNN has an inference complexity O(|s| H) . H is the number of historical sessions in the training data and thus is a problem when working datasets with billions of clicks. VMIS-kNN has an inference complexity of O(|s| m) in VMIS-kNN.

Perhaps you can describe what it is that you would like to test, then I can help you setting it up.

twitu commented 1 year ago

Currently we're trying to run serenade on the public datasets - retailrocket and rsc15 from scratch. First to reproduce the results and then to run different datasets on it and compare.

Starting with retailrocket we're trying to compare the test results for indexes generated from the gdrive dataset and from the original retailrocket kaggle dataset.

We are trying to convert to kaggle dataset into a csv file so it can be fed into the server. However we are stuck here and will be very grateful if you could share how you converted it for your tests. After significant modifications to the pyspark job in create_serenade_indexes.py we are stuck at creating sessions ids.

  1. The raw dataset has visitorid field name instead of bui
  2. The pipeline partitions the dataset based on visitorid and order by time
  3. Using a threshold value find unique sessions for the visitor So something like this (using simplified notation)

    vistorid time(s)
    A 10, 11, 12, 23, 24, 25, 30
    B 89, 90, 91, 100
    sessionid time(s)
    A_1 10, 11, 12
    A_2 23, 24, 25
    A_3 30
    B_1 89, 90, 91
    B_2 100
    Unique session id for each visitor is a string concatenated with it's visitor id. However the input csv file requires session ids to be integers. There must be a step that converts this to something like. sessionid time(s)
    1 10, 11, 12
    2 23, 24, 25
    3 30
    4 89, 90, 91
    5 100

Any pointers here on the dataset was transformed into inputs for the benchmarks mentioned in the paper?

twitu commented 1 year ago

There is one other question remaining about the test dataset. If you could share how the test set was sampled from the original dataset it'll be a big help.

bkersbergen commented 1 year ago

The repo from Ludewig et al was used to preprocess public datasets using pandas. This splits the datasets but also filters for sessions and items with a minimal support. https://github.com/rn5l/session-rec/blob/master/preprocessing/session_based/preprocess_retailrocket.py

Once you have created the datasets on your own machines you can have a look at the serenade server. It contains the VMIS-kNN algorithm with an Actix webservice. It has a quick guide on how to get started with Serenade.

GresgentonG commented 1 year ago

Hi, @bkersbergen

Really sorry for the late reply and thank you very much for pointing out the direction in the repo from Ludewig et al, we have been following the instruction carefully and successfully reproduced the results and be able to proceed. Thank you very much for answering our questions!