Closed BrendanSchell closed 4 years ago
Hi @BrendanSchell,
Thanks for the feature request! The window can be adjusted to an infinite size. Do we want to generate labels for sessions?
For example, with a transactions table of customers, we can generate labels for each session by changing the target entity to the session ID. The cutoff time will be the first timestamp of the session.
import composeml as cp
df = cp.demos.load_transactions()
df.filter(regex='session|customer|amount|product').sample(n=5)
session_id product_id amount customer_id session_start
24 20 5 83.33 5 2014-01-01 04:46:00
40 5 5 97.18 4 2014-01-01 01:11:30
72 26 5 42.81 1 2014-01-01 06:17:00
96 34 5 145.19 3 2014-01-01 08:24:50
69 18 5 133.49 1 2014-01-01 04:14:35
def did_purchase(session):
return session['amount'].sum() > 0
lm = cp.LabelMaker(
target_entity="session_id",
time_index="session_start",
labeling_function=did_purchase,
)
lt = lm.search(df, -1)
session_id cutoff_time did_purchase
id
0 1 2014-01-01 00:00:00 True
1 2 2014-01-01 00:17:20 True
2 3 2014-01-01 00:28:10 True
3 4 2014-01-01 00:44:25 True
4 5 2014-01-01 01:11:30 True
Is this the expected output?
Similar to the expected output, but I would want the customer_id to be there as well since that's the id I want to predict on in this case
Oh okay, thanks for clarifying! I think there are two approaches that we can take. In both approaches, we iterate over each session for each customer.
The first approach is allowing a column to be the window size. The parameter can be different and not required to be the window_size
parameter.
lm = cp.LabelMaker(
target_entity="customer_id",
time_index="session_start",
window_size="session_id",
labeling_function=did_purchase,
)
customer_id cutoff_time did_purchase
id
0 1 2014-01-01 00:00:00 True
1 1 2014-01-01 00:17:20 True
2 1 2014-01-01 00:28:10 True
3 2 2014-01-01 00:44:25 True
4 2 2014-01-01 01:11:30 True
The second approach is allowing customers and sessions to be the target entity. When you provide more than one target entity, only the first column is used as the instance id.
lm = cp.LabelMaker(
target_entity=["customer_id", "session_id"],
time_index="session_start",
labeling_function=did_purchase,
)
customer_id cutoff_time did_purchase
id
0 1 2014-01-01 00:00:00 True
1 1 2014-01-01 00:17:20 True
2 1 2014-01-01 00:28:10 True
3 2 2014-01-01 00:44:25 True
4 2 2014-01-01 01:11:30 True
Does one of these approaches seem like a better API to you?
Sorry @jeff-hernandez I missed this before. I think I like the first approach better since it's more explicit. Thanks though, that looks exactly like what I was thinking!
Hi @BrendanSchell,
This feature is complete and should be available in the next release! Let me know if you have any questions or feedback. Thanks again for the feature request!
It would be really useful to be able to use windows that aren't time-based for target variable creation. For example, if I have a user and I want to predict whether they purchased within a session or not, I would want to basically take all of that user's sessions (or a subset of them) and make a slice per session ID (I think there will usually be some column like this that is an identifier). I would still want the output format to be the same (target_entity, cutoff_time, target variables). The default cutoff_time in this case would just be the first timestamp of that session ID.