franrruiz / shopper-src

Code for Shopper, a probabilistic model of shopping baskets
MIT License
51 stars 32 forks source link

Question about assigning session ids #6

Closed tblazina closed 3 years ago

tblazina commented 3 years ago

I had a question regarding data preparation:

In the project readme, it's stated:

"..each session corresponds to a specific date and items price configuration" and then from the valid input:

0    10    20    1
0    11    20    1
0    10    30    1
0    20    30    1
0    21    30    1
1    10    20    1
1    11    20    1
1    20    20    1

it is stated that user 1 and 2 share session_id 20, but I am unclear still on how to assign session_ids in my own data. Are session_ids shared if only 1 item in their basket is shared? In the above example all of user 0's items in session 20 basket are also in user 1's session 20 basket but user 1 has an additional item 20 in their basket. I assume the session id's are shared in this case because both baskets contain items 10 and 11 at some price configuration. But in the following fake example:

0    10    30    1
0    20    30    1
0    21    30    1
1    10    30    1
1    11    30    1
1    12    30    1

Would assigning the same session_id, as is done here, be correct simply because users 0 and 1 both have item 10 in their baskets, despite not sharing any other items in their basket?

Any hints would be much appreciated, thank you!

franrruiz commented 3 years ago

Hi Tim,

The decision to assign the same or a different session_id is completely independent of the fact that the two users share (or share not) items in their shopping baskets.

As indicated in the description: "Sessions allow to indicate multiple shopping trips for each user. In addition, each session corresponds to a specific date and items price configuration."

So the question that you need to ask is: "does the shopping trips of users 0 and 1 share the same date and item prices?" If no, then you should assign different session_id. If yes, then I'd recommend to assign the same session_id.

I hope that helps!

tblazina commented 3 years ago

Hi Francisco,

Thanks for the quick answer and yes I think this more or less answer my question about how to assign this. Maybe one last quick question for clarification:

Say for example one has the following case where one item, item 10, has two different prices because it was in promotion for one user and not the other (for example in a personalized promotion setting), and the users have identical baskets:

0    10    20    1
0    11    20    1
0    12    20    1
1    10    30    1
1    11    30    1
1    12    30    1

then, in this case, because the date and price configuration that the two users paid for item 10 is different, they would get different session_ids, as shown above. Is this correct?

Thanks again!

franrruiz commented 3 years ago

Yes, that is correct - that is the way to achieve that.

In general, when in doubt, it is not harmful to use different session_id. The "only" disadvantage of using different ids is that you need to add more rows on item_sess_price.tsv and sess_days.tsv.

Please feel free to ask if you have further questions.

tblazina commented 3 years ago

Thanks so much for the quick and thoughtful responses Francisco! 👍

tblazina commented 3 years ago

I do actually have another question @franrruiz. I've still not entirely understood how to resolve some issues with assigning the session ids, and have essentially ended up assigning a session_id to every basket (in our case a basket is defined by a customer-date-time index on a transaction level). We have taken a sample of 1000 customers, and looked at their transaction history over a 2.5 year period. This leads to ~11 million data points (at level of transaction-article) and ~880,000 baskets.

The number of users x articles is less than the 10^9 - 10^10 "limit" you recommend in the project README. However I'm getting an error by initialization of the parameters:

Initializing latent parameters...
terminate called after throwing an instance of 'std::bad_alloc'

I have very little experience with C++ but I'm assuming this is an error relating to not having enough memory. This was running on VM on Google Cloud Platform with 4 CPUs - 15GB RAM, and 1 NVIDIA Tesla T4 GPU.

I am wondering if this amount of data is already too much, or if this huge number of session_ids is the problem or maybe something else comes to mind? And am wondering roughly what sort of hardware is needed to train the model (perhaps if you remember what you used when you trained it on the bigger data set you report on in the paper)?

Thanks again!

franrruiz commented 3 years ago

Hi - It looks like the program read the data well, but there it has an OOM error when initializing the latents.

To diagnose what's going on, can you copy&paste the output in the log file? The file should start with a few lines describing the data ("Nusers=xxx, Nitems=yyy", etc.) and then with the rest of params.

tblazina commented 3 years ago

Ya, so if I set the keepAbove to 100, it doesn't run into the OOM error, but nonethless here is the logfile output from that run that had the error. Note: I was just more ore less using the parameters you used in your example in the README and have left out the price part of the model for now until I figure out how to assign these session_ids correctly...

Data:
 +datadir=dat/mgb_data
 +Nusers=5698
 +Nitems=42579
 +Nsessions=880791
 +Ntrans=616492
 +Ntrans (test)=264237
 +Ndays=53
 +Nweekdays=1
 +NuserGroups=0
 +NitemGroups=64
 +Lines of train.tsv=7534590
 +Lines of test.tsv=3227050
 +Lines of validation.tsv=186
Parameters:
 +outdir=out/t616492-n5698-m42579-k10-intercept-users3-days10-lik1-avgCtxt1-shuffle1-eta0.01-zF0.1-nS3-batch100-chkout
 +K=10
 +Kgroup=0
 +fixKgroup=0
 +seed=0
 +rfreq=200000
 +saveCycle=100000
 +max-iterations=20000
 +negsamples=3
 +nsFreq=-1
 +likelihood=1
 +lookahead=0
 +avgContext=1
 +symmetricRho=0
 +checkout=1
 +shuffle=1
 +zeroFactor=0.100000
 +batchsize=100
 +userVec=3
 +itemIntercept=1
 +price=0
 +day=10
 +normPrice=1
 +normPriceMin=0
 +step_schedule=1
 +eta=0.010000
 +gamma=0.900000
 +valTolerance=0.000001
 +valConsecutive=5
 +keepOnly=-1
 +keepAbove=-1
 +thr_llh=-100000.000000
 +threads=1
Initialization:
 +stdIni=0.316228
 +iniPath=
 +iniFromGroup=
Hyperparameters:
 +s2rho=1.000000
 +s2alpha=1.000000
 +s2theta=1.000000
 +s2lambda=1.000000
 +rtegamma=1000.000000
 +shpgamma=100.000000
 +rtebeta=1000.000000
 +shpbeta=100.000000
 +s2delta=0.010000
 +s2mu=0.010000
Data:
 +datadir=dat/mgb_data
 +Nusers=5698
 +Nitems=42579
 +Nsessions=880791
 +Ntrans=616492
 +Ntrans (test)=264237
 +Ndays=53
 +Nweekdays=1
 +NuserGroups=0
 +NitemGroups=64
 +Lines of train.tsv=7534590
 +Lines of test.tsv=3227050
 +Lines of validation.tsv=186
Parameters:
 +outdir=out/t616492-n5698-m42579-k10-intercept-users3-days10-lik1-avgCtxt1-shuffle1-eta0.01-zF0.1-nS3-batch100-chkout
 +K=10
 +Kgroup=0
 +fixKgroup=0
 +seed=0
 +rfreq=200000
 +saveCycle=100000
 +max-iterations=20000
 +negsamples=3
 +nsFreq=-1
 +likelihood=1
 +lookahead=0
 +avgContext=1
 +symmetricRho=0
 +checkout=1
 +shuffle=1
 +zeroFactor=0.100000
 +batchsize=100
 +userVec=3
 +itemIntercept=1
 +price=0
 +day=10
 +normPrice=1
 +normPriceMin=0
 +step_schedule=1
 +eta=0.010000
 +gamma=0.900000
 +valTolerance=0.000001
 +valConsecutive=5
 +keepOnly=-1
 +keepAbove=-1
 +thr_llh=-100000.000000
 +threads=1
Initialization:
 +stdIni=0.316228
 +iniPath=
 +iniFromGroup=
Hyperparameters:
 +s2rho=1.000000
 +s2alpha=1.000000
 +s2theta=1.000000
 +s2lambda=1.000000
 +rtegamma=1000.000000
 +shpgamma=100.000000
 +rtebeta=1000.000000
 +shpbeta=100.000000
 +s2delta=0.010000
 +s2mu=0.010000
franrruiz commented 3 years ago

Ok, thanks for the update. Yes, it looks like the error was simply due to the high number of items (Nusers*Nitems was about 240M, which is O(10^8) - perhaps still too large for this GPU).

tblazina commented 3 years ago

Ok, makes sense! If we were to simply increase the number of GPUs on the machine we are using, I guess that this then help resolve this issue?

franrruiz commented 3 years ago

In theory yes, but in practice I believe the current implementation allows for 1 GPU only.

tblazina commented 3 years ago

Ah ok, yes I wasn't sure with the implementation as it is. As said, we have basically no experience with C++ in our team so don't think its feasible to refactor the code to work on multiple GPUs. Thanks for clarification.

Maybe a bit of a more theoretical question:

For our use case we are actually mostly interested in looking at the substitution metric, in the context of trying to understand what products potentially "cannibalize" each other.

do you think it would make sense if we were to say take multiple smaller sub samples of the transaction data, train the model on these sub samples separately on different machines, and then somehow blend (take some sort of average, weight average, or whatever) the resulting parameters (where there is overlap between the samples in terms of customers and products)?

franrruiz commented 3 years ago

That depends on your application and how you create the subsets, but in general I'd adice against that as it's likely to create severe biases.

tblazina commented 3 years ago

Ok, thanks much again for all the help and clarifications, it's greatly appreciated and we are looking forward to seeing what we get for results from the model!

tblazina commented 3 years ago

If you'll entertain one more question:

We have run the model successfully but I am have a bit of trouble understanding the calculation of the exchangeability metric you've defined in the paper, specifically, I'm not fully grasping how the $p{k|c}$ and $p{k|c`}$ are calculated from the estimated parameters. Specifically from formula (8) in the paper psi_tc ends up being specific to each trip and therefor specific for each user, so I'm not entirely seeing how the matrix multiplications work out when you are simply trying to calculate the more general

$p{k|c)$ and $p{k|c`}$ in equation (13), which are calculating the conditional probabilities based on some theoretical trip where you have 2 items in your basket.

If we for example have K=50 In this case would psi_tc then have the dimensions of (2 x 50)?

If so I'm not then following how you calculate the Psi in equation 4 and ultimately the probabilities in equation 3, as you are adding psi_tc to rho_c (which has dimensions of (counts articles x K)) and adding * (50).

Sorry if the question is a bit poorly formulated, overall I'm just a bit confused how to calculate these conditional probabilities directly from the estimated parameters from the model. Otherwise when I look at seasonal effects, and cosine similarities with alpha_c, things are looking pretty reasonable! Thanks again for any hints you can give, they are greatly appreciated.

franrruiz commented 3 years ago

$p{k|c)$ and $p{k|c`}$ in equation (13), which are calculating the conditional probabilities based on some theoretical trip where you have 2 items in your basket.

I think you meant "where you have 1 item in your basket".

If we for example have K=50. In this case would psi_tc then have the dimensions of (2 x 50)?

No - Note that \psi_tc involves inner products such as \theta * \alpha (Eq. 8), and so the result is a scalar. For the computation of the exchangeability score, we set \theta_u to the average of all the learned \theta_u's, \delta_w to the average of all the learned \delta_w's, and we drop the price term.

tblazina commented 3 years ago

Thanks Francisco, yes you are correct, I meant calculating the conditional probability with one other item in the basket.

Ok I was misunderstanding that these were inner products.

So for the exchangeability score, if you drop the price term, take the average learned \theta_u's and \delta_w, you'd end up with \psitc as a vector with length of number of articles which represents the average utility of all the items. And if you were calculating the conditional probability p(k|c), because there are no other items in the basket you ignore the second part of equation 4, and p(c) becomes:

p(c) = exp(psi_tc[c]) / sum(exp(psi_tc))

and then the utility of item k, given c is in the basket

Psi = psi_tc[k] + rho_c[k] * (alpha_c[c] + alpha_c[k])   # inner product
p(k) = exp(PSI) / sum(exp(psi_tc[!c])) # exclude item c because it is no longer a possible choice

and

p_(k|c) = (p(c) * p(k)) / p(c)

and similarly for c` and then you can use these conditional probabilities in equation (13), correct?

Sorry for the exhaustive questions, I'm still not sure if I'm correctly understanding how the denominator is calculated in equation (3) correctly because when I calculate this as I've just explained, the numerator when calculating p(k) is a very large number, and the denominator not so much and I'm just a bit confused by the statement of zeroing out the probabilities in the paper:

"Let pk|c denote the probability of item k given that item c is the only item in the basket (in this definition, we zero-out the probabilities of items c, c′, and checkout)"

and want to make sure I'm calculating this correctly 😀 again thanks for taking the time to answer all my questions.

franrruiz commented 3 years ago

And if you were calculating the conditional probability p_(k|c), because there are no other items in the basket you ignore the second part of equation 4

This is not correct: p_{k|c} precisely denotes the probability of item k, given that item c is in the basket. So there is one item in the basket: it is item c. So you have Psi_k = psi_k + rho_k * alpha_c, where the base psi_k is computed as I explained earlier.

With that, you can obtain Psi_k for all items k=1,...,N_items (each Psi_k is a scalar). After that, you need to set Psi_c=-inf and Psi_{c'}=-inf (that is what it means to zero-out the probabilities), and then you simply take the softmax, p_{k|c} \propto exp(Psi_k).

With that, you have p_{k|c}, i.e., the probability of purchasing each item k (for k=1,...,N_items) given that c is in the basket, and you have ensured that p_{k=c|c} = p_{k=c'|c} = 0. Now, you need to repeat the same procedure to obtain p_{k|c'}. After that, you simply need to compute the KL between both probabilities distributions.

tblazina commented 3 years ago

Thanks!

With that, you can obtain Psi_k for all items k=1,...,N_items (each Psi_k is a scalar). After that, you need to set Psi_c=-inf and Psi_{c'}=-inf (that is what it means to zero-out the probabilities), and then you simply take the softmax, p_{k|c} \propto exp(Psi_k).

just to restate:

So Psi_k is a vector with length of N_items with probabilities for item c and c' zero'd out, and if you take the soft max (i,e, exp(Psi_k) / sum(exp(Psi_k)), you end up with the conditional probabilities - k|c (i.e. probability of buying every item k conditional on c being in the basket). Repeat this for every other item c, and you'd end up probability vectors for every item conditional on every other item being in the basket (so you could make a matrix with dimensions N_items x N_items), correct?

With that, you have p_{k|c}, i.e., the probability of purchasing each item k (for k=1,...,N_items) given that c is in the basket, and you have ensured that p_{k=c|c} = p_{k=c'|c} = 0. Now, you need to repeat the same procedure to obtain p_{k|c'}. After that, you simply need to compute the KL between both probabilities distributions.

Here is one more point of confusion for me, you say to calculate the KL between the two probability distributions - In this case you're referring to the these vectors of conditional probabilities as the distributions (because they are a vector that sums to 1) so you'd just calculate the KL between these vectors of conditional probabilities, p_{k|c} and p_{k|c'} and vice versa, as shown in equation (13) and then you can calculate this exchangeability metric between all items

franrruiz commented 3 years ago

What you say in the first paragraph is correct - except each Psi_k is not a vector; it is a scalar (you can form a vector by considering all k=1,...,N_items).

What you say in the second paragraph seems also correct to me - what is the point of confusion?

Just to clarify a bit, in case it's useful, the procedure would be as follows:

tblazina commented 3 years ago

Yes, that is what I meant Psi_k as a vector of all items k=1,2,...,N_items. And ya sorry no point of confusion, for some reason I was getting tripped up by the notation and not just considering that it is just a softmax and therefore turns the utilities into probabilities. And thanks for the procedure description!

I'm doing the analysis of the parameters in Python, if you're interested I could fork the repo and you could link it, or make a PR in this repo with a python script or a Jupyter notebook with some of the qualitative measure (item similarity, exchangeability, etc.) calculations from the output parameters of the model?

Maybe one last question: if we were interested in doing counterfactual analysis of item prices, I guess that it would look similar in terms of averaging per user and per item latent vectors, gamma_u and beta_c and then adjusting the normalized price r_tc up or down and then seeing how the average purchase probability changes for the item by itself (or potentially in a basket with other items), correct?