Closed GresgentonG closed 1 year ago
Hi @GresgentonG
1) preprocessing public data Transforming raw public datasets into preprocessed ones is outside the scope of our work. The repo from Ludewig et al contains code to preprocess public datasets using pandas.
2) offline index creation For datasets up to 200M rows its perfectly fine to use the CSV reader in Serenade. You don't need to create an index offline. The offline index creation is only needed in our production setup where we are handling very large clickdatasets that do not fit on a single harddisk. We shared our Spark code to do this.
Let me know if you need help with setting up the experiments.
Hi @bkersbergen,
Thank you very much for your quick reply! May I just confirm our understandings below:
For the smaller public datasets, we can feed csv file with Time
, ItemId
, SessionId
columns directly to the serenade server, without the need to compute the index offline using the pyspark code
The catalog file is only used for your private production data
Besides, we would also like to ask some more follow-up questions below:
How to generate the SessionId
column for public datasets?
bui
column mentioned in the code of the sessionizer, which is somehow used as unique user identifiers to our understanding. But for example the retailrocket dataset does not contain such related information, so how should we handle this?SessionId
column, or more generally to preprocess more public datasets.Is it possible if we still want to compute the offline index for smaller public datasets (even if it is not necessary)? If so,
Thank you!
Hi,
1) Correct. The happy-flow is to have Serenade read a csv file when training the model. It then internally computes its efficient index. 2) Correct. When using CSV files this code is not executed. 3) I can help you in that. 4) Yes you can but I would not advise it. There is no performance gain during inference if precompute the index or use the CSV reader. VS-kNN has an inference complexity O(|s| H) . H is the number of historical sessions in the training data and thus is a problem when working datasets with billions of clicks. VMIS-kNN has an inference complexity of O(|s| m) in VMIS-kNN.
Perhaps you can describe what it is that you would like to test, then I can help you setting it up.
Currently we're trying to run serenade on the public datasets - retailrocket and rsc15 from scratch. First to reproduce the results and then to run different datasets on it and compare.
Starting with retailrocket we're trying to compare the test results for indexes generated from the gdrive dataset and from the original retailrocket kaggle dataset.
We are trying to convert to kaggle dataset into a csv file so it can be fed into the server. However we are stuck here and will be very grateful if you could share how you converted it for your tests. After significant modifications to the pyspark job in create_serenade_indexes.py
we are stuck at creating sessions ids.
visitorid
field name instead of bui
visitorid
and order by timeUsing a threshold value find unique sessions for the visitor So something like this (using simplified notation)
vistorid | time(s) |
---|---|
A | 10, 11, 12, 23, 24, 25, 30 |
B | 89, 90, 91, 100 |
sessionid | time(s) |
---|---|
A_1 | 10, 11, 12 |
A_2 | 23, 24, 25 |
A_3 | 30 |
B_1 | 89, 90, 91 |
B_2 | 100 |
Unique session id for each visitor is a string concatenated with it's visitor id. However the input csv file requires session ids to be integers. There must be a step that converts this to something like. | sessionid | time(s) |
---|---|---|
1 | 10, 11, 12 | |
2 | 23, 24, 25 | |
3 | 30 | |
4 | 89, 90, 91 | |
5 | 100 |
Any pointers here on the dataset was transformed into inputs for the benchmarks mentioned in the paper?
There is one other question remaining about the test dataset. If you could share how the test set was sampled from the original dataset it'll be a big help.
The repo from Ludewig et al was used to preprocess public datasets using pandas. This splits the datasets but also filters for sessions and items with a minimal support. https://github.com/rn5l/session-rec/blob/master/preprocessing/session_based/preprocess_retailrocket.py
Once you have created the datasets on your own machines you can have a look at the serenade server. It contains the VMIS-kNN algorithm with an Actix webservice. It has a quick guide on how to get started with Serenade.
Hi, @bkersbergen
Really sorry for the late reply and thank you very much for pointing out the direction in the repo from Ludewig et al, we have been following the instruction carefully and successfully reproduced the results and be able to proceed. Thank you very much for answering our questions!
Hi, we are a research group interested in reproducing your result and running your algorithm on more public datasets. However we are encountering the some issues when handling raw dataset. For example, about the retailrocket dataset, we do manage to get the
create_serenade_indexes.py
running on the transformed data you provided, which contains a column calledSessionId
, with the following modification to the code:modifications
- config spark to be able to handle `avro` files ```diff @@ -32,43 +32,46 @@ def create_serenade_indexes(catalog_input_dir, qty_lookback_days, end_date, base **app_settings) spark = (SparkSession.builder + .config("spark.jars.packages", "org.apache.spark:spark-avro_2.12:3.3.1") .getOrCreate()) sc = spark.sparkContext ``` - ignoring the pipeline ```diff @@ -51,24 +52,26 @@ def create_serenade_indexes(catalog_input_dir, qty_lookback_days, end_date, base context_path = base_input_dir + '/{' + ','.join(prev_days) + '}' print('context_path:' + str(context_path)) - events10 = sql_context.read.format("avro").load(context_path) - events10 = events10.filter((fn.col('qty_detailpage') + fn.col('qty_purchased')) > fn.lit(0)) - events10 = events10.withColumnRenamed('item_id', 'ItemId') + events10 = sql_context.read.option("header", True).csv(base_input_dir, sep="\t", inferSchema=True) + events10.printSchema() + # events10 = sql_context.read.format("avro").load(context_path) + # events10 = events10.filter((fn.col('qty_detailpage') + fn.col('qty_purchased')) > fn.lit(0)) + # events10 = events10.withColumnRenamed('item_id', 'ItemId') events10 = events10.withColumn('ItemId', fn.col('ItemId').cast(LongType())) - events10 = events10.withColumn('Time', (fn.col('timestamp') / fn.lit(1000)).cast(IntegerType())) # milliseconds to seconds conversion - events10 = events10.drop(*['timestamp']) - - my_pipeline = Pipeline(stages=[SelectCommerciallyViableCustomers(), - Sessionizer(), - MostRecentSessionItem(), - # minimum support for session, item, session. - MinimumColumnSupport(inputCol='SessionId', min_support=params.min_session_length), - MinimumColumnSupport(inputCol='ItemId', min_support=params.min_item_support), - MinimumColumnSupport(inputCol='SessionId', min_support=params.min_session_length), - ]) - full_df = my_pipeline.fit(events10).transform(events10) + events10 = events10.withColumn('Time', (fn.col('Time') / fn.lit(1000)).cast(IntegerType())) # milliseconds to seconds conversion + # events10 = events10.drop(*['timestamp']) + + # my_pipeline = Pipeline(stages=[SelectCommerciallyViableCustomers(), + # Sessionizer(), + # MostRecentSessionItem(), + # # minimum support for session, item, session. + # MinimumColumnSupport(inputCol='SessionId', min_support=params.min_session_length), + # MinimumColumnSupport(inputCol='ItemId', min_support=params.min_item_support), + # MinimumColumnSupport(inputCol='SessionId', min_support=params.min_session_length), + # ]) + # full_df = my_pipeline.fit(events10).transform(events10) # Rename SessionId to VisitId to have more distinctive column names when exporting the sessionindex. - full_df = full_df.withColumnRenamed('SessionId', 'VisitId') + full_df = events10.withColumnRenamed('SessionId', 'VisitId') ``` - ignore everything about the catalog (which should be provided by `catalog_input_dir`), and the harry potter thingsBut we are not sure if everything is correct, since we are running the code on the transformed retailrocket data provided by you, and still do not know how to transform the original retailrocket data into the form that can be accepted by the algorithm. To be more specific, the
Time
andItemId
columns seems to be existing in the original dataset, butSessionId
seems to be missing.It could be very helpful if you can give us some hint on this, thanks!