How to create a Label that indicates the Future Type (or state) of something

alteryx / compose

A machine learning tool for automated prediction engineering. It allows you to easily structure prediction problems and generate labels for supervised learning.

https://compose.alteryx.com

BSD 3-Clause "New" or "Revised" License

490 stars 46 forks source link

How to create a Label that indicates the Future Type (or state) of something #214

Open S-UP opened 3 years ago

S-UP commented 3 years ago

I wonder about the best approach to create a Label that generates forward-looking classes.

Example: A customer might purchase 12 times (on different dates).

I want to assign a label that says he/she will do a next purchase (within X months) after a given event was observed. Thus, after having observed the first transaction, will this customer come back and register another transaction? If so, he/she should receive a label 'will purchase again'. Else 'will NOT purchase again'.

From what I've seen Compose always constructs labels using all events up until (but excluding) another event (for which the label is then set). So I wonder how to generate a label for the last transaction observed in the above example. The 12th transaction is the last recorded and thus we would label a 'will NOT purchase again' here as we know the customer will not transact again.

The overall goal is to identify customers who are most likely to re-engage. Maybe there is also a more suitable modeling approach to this.

jeff-hernandez commented 3 years ago

Thanks for the question! Would a row-based window size be a good modeling approach? The row-based window size can get you the current purchase and the next purchase. Then, you can compare the times for labeling. I'll go through an example using this data.

import composeml as cp
import pandas as pd

df = pd.read_csv(
    'data.csv',
    parse_dates=['transaction_time'],
    index_col='transaction_id',
)

df

	transaction_time	amount	department	customer_id
transaction_id
351	2021-01-02 14:24:55	18.64	computers	1
101	2021-01-04 11:44:13	12.15	automotive	1
1	2021-01-12 03:44:33	78.91	grocery	1
501	2021-01-15 11:54:25	50.91	garden	1
651	2021-01-21 06:55:16	11.62	books	1
51	2021-01-21 22:06:39	94.62	electronics	1
801	2021-01-25 19:20:22	53.26	shoes	1
901	2021-02-07 16:57:13	58.74	movies	1
401	2021-02-08 14:50:14	42.83	kids	1
851	2021-02-10 08:38:04	69.11	baby	1
151	2021-02-21 01:53:37	55.02	computers	1
251	2021-02-21 13:01:35	55.99	jewelery	1

This labeling function will get a data slice with two rows -- the current purchase and the next purchase. It also has a within parameter to determine whether the next purchases happened within a given time.

def next_purchase(df, within):
    if len(df) < 2: return False
    within = pd.Timedelta(within)
    next_time = df.index[1] - df.index[0]
    return within >= next_time

lm = cp.LabelMaker(
    target_entity='customer_id',
    time_index='transaction_time',
    labeling_function=next_purchase,
    window_size=2, # two rows to get current and next purchase
)

When running the search, the gap is set to one so that each data slice starts on the next purchase.

lt = lm.search(
    df=df.sort_values('transaction_time'),
    num_examples_per_instance=-1,
    gap=1, # one row to start on next purchase
    within='3 days',
    verbose=False,
)

lt

	customer_id	time	next_purchase
0	1	2021-01-02 14:24:55	True
1	1	2021-01-04 11:44:13	False
2	1	2021-01-12 03:44:33	False
3	1	2021-01-15 11:54:25	False
4	1	2021-01-21 06:55:16	True
5	1	2021-01-21 22:06:39	False
6	1	2021-01-25 19:20:22	False
7	1	2021-02-07 16:57:13	True
8	1	2021-02-08 14:50:14	True
9	1	2021-02-10 08:38:04	False
10	1	2021-02-21 01:53:37	True
11	1	2021-02-21 13:01:35	False

Let me know if this approach can work.

S-UP commented 3 years ago

Interesting approach. Thanks for sharing!

Questions: Why can you use next_time = df.index[1] - df.index[0] given the index of the data frame is not a time index?

jeff-hernandez commented 3 years ago

Why can you use next_time = df.index[1] - df.index[0] given the index of the data frame is not a time index?

The data frame slices that are given to the labeling function do have the time index set as the index. During the search, the label maker sets the time index as the data frame index.

Does a row-based window size work for your use case?

S-UP commented 3 years ago

May I ask how you would extend your approach for situations where there is only one product type to consider.

Or to stick with the above example: Assume we are just interested in Department==Computer type of transactions. The row-based approach will take two neighboring lines while in fact what is needed is a validation of whether or not a Computer transaction will happen any time within the specified time window.

Would be interested to hear your thoughts on this.

jeff-hernandez commented 3 years ago

@S-UP thanks for the question! In that case, I think it'd make sense to isolate the computer department before generating labels. We can group by the department and select computers.

computers = df.groupby('department').get_group('computers')

lt = lm.search(
    df=computers.sort_values('transaction_time'),
    num_examples_per_instance=-1,
    gap=1, # one row to start on next purchase
    within='3 days',
    verbose=False,
)

S-UP commented 3 years ago

Thanks. I realize I should have been more explicit.

I still want to create labels per Transaction ID or, potentially, Transaction Date (i.e. aggregating all transactions into a single transaction date). So, if a customer purchases from Garden and does not purchase from Computer within the specified window, then there shall be a Next Purchase == False flag for the Garden transaction.

jeff-hernandez commented 3 years ago

Ah okay, in that case, you can use the window_size to specify the time window and check if the department of first transaction occurred more than once.

def next_purchase(df):
    department = df.iloc[0].department
    return df.department.eq(department).sum() > 1

lm = cp.LabelMaker(
    target_entity='customer_id',
    time_index='transaction_time',
    labeling_function=next_purchase,
    window_size='3d',  # time window
)

lt = lm.search(
    df=df.sort_values('transaction_time'),
    num_examples_per_instance=-1,
    gap=1,  # one to iterate over each transaction
    verbose=False,
)

If you only want labels for a single department, you can also make it a parameter to the labeling function.

def next_purchase(df, department):
    return df.department.eq(department).sum() > 1

lt = lm.search(
    ...,
    department='computers',
)