alteryx / compose

A machine learning tool for automated prediction engineering. It allows you to easily structure prediction problems and generate labels for supervised learning.
https://compose.alteryx.com
BSD 3-Clause "New" or "Revised" License
492 stars 46 forks source link

Allow non-time-based windows #99

Closed BrendanSchell closed 4 years ago

BrendanSchell commented 4 years ago

It would be really useful to be able to use windows that aren't time-based for target variable creation. For example, if I have a user and I want to predict whether they purchased within a session or not, I would want to basically take all of that user's sessions (or a subset of them) and make a slice per session ID (I think there will usually be some column like this that is an identifier). I would still want the output format to be the same (target_entity, cutoff_time, target variables). The default cutoff_time in this case would just be the first timestamp of that session ID.

jeff-hernandez commented 4 years ago

Hi @BrendanSchell,

Thanks for the feature request! The window can be adjusted to an infinite size. Do we want to generate labels for sessions?

For example, with a transactions table of customers, we can generate labels for each session by changing the target entity to the session ID. The cutoff time will be the first timestamp of the session.

import composeml as cp

df = cp.demos.load_transactions()
df.filter(regex='session|customer|amount|product').sample(n=5)
    session_id  product_id  amount  customer_id        session_start
24          20           5   83.33            5  2014-01-01 04:46:00
40           5           5   97.18            4  2014-01-01 01:11:30
72          26           5   42.81            1  2014-01-01 06:17:00
96          34           5  145.19            3  2014-01-01 08:24:50
69          18           5  133.49            1  2014-01-01 04:14:35
def did_purchase(session):
    return session['amount'].sum() > 0

lm = cp.LabelMaker(
    target_entity="session_id",
    time_index="session_start",
    labeling_function=did_purchase,
)

lt = lm.search(df, -1)
    session_id         cutoff_time did_purchase
id                                             
0            1 2014-01-01 00:00:00         True
1            2 2014-01-01 00:17:20         True
2            3 2014-01-01 00:28:10         True
3            4 2014-01-01 00:44:25         True
4            5 2014-01-01 01:11:30         True

Is this the expected output?

BrendanSchell commented 4 years ago

Similar to the expected output, but I would want the customer_id to be there as well since that's the id I want to predict on in this case

jeff-hernandez commented 4 years ago

Oh okay, thanks for clarifying! I think there are two approaches that we can take. In both approaches, we iterate over each session for each customer.

The first approach is allowing a column to be the window size. The parameter can be different and not required to be the window_size parameter.

lm = cp.LabelMaker(
    target_entity="customer_id",
    time_index="session_start",
    window_size="session_id",
    labeling_function=did_purchase,
)
    customer_id         cutoff_time did_purchase
id                                             
0             1 2014-01-01 00:00:00         True
1             1 2014-01-01 00:17:20         True
2             1 2014-01-01 00:28:10         True
3             2 2014-01-01 00:44:25         True
4             2 2014-01-01 01:11:30         True

The second approach is allowing customers and sessions to be the target entity. When you provide more than one target entity, only the first column is used as the instance id.

lm = cp.LabelMaker(
    target_entity=["customer_id", "session_id"],
    time_index="session_start",
    labeling_function=did_purchase,
)
    customer_id         cutoff_time did_purchase
id                                             
0             1 2014-01-01 00:00:00         True
1             1 2014-01-01 00:17:20         True
2             1 2014-01-01 00:28:10         True
3             2 2014-01-01 00:44:25         True
4             2 2014-01-01 01:11:30         True

Does one of these approaches seem like a better API to you?

BrendanSchell commented 4 years ago

Sorry @jeff-hernandez I missed this before. I think I like the first approach better since it's more explicit. Thanks though, that looks exactly like what I was thinking!

jeff-hernandez commented 4 years ago

Hi @BrendanSchell,

This feature is complete and should be available in the next release! Let me know if you have any questions or feedback. Thanks again for the feature request!