feature-engine / feature_engine

Feature engineering package with sklearn like functionality
https://feature-engine.trainindata.com/
BSD 3-Clause "New" or "Revised" License
1.8k stars 303 forks source link

train_test_split that s #661

Closed Morgan-Sell closed 1 year ago

Morgan-Sell commented 1 year ago

Is your feature request related to a problem? Please describe. Sometimes the observations/rows are not at the level at which we want to split a dataset into train and test.

For example, I have a dataset comprised of pharmacy claims. Each observation/row has a unique claim ID. I would like to predict whether an individual, who has multiple pharmacy claims, will start a certain type of medication.

I want to split the dataset into train and test, but I want to split the dataset by individual ID, not claim ID.

Describe the solution you'd like I would like to see a train_test_split function that allows the user to select the variable in which to split the dataset into train and test.

Describe alternatives you've considered My current approach using the above pharmacy claims example:

test_size = 0.2
unique_ids = list(df["individual_id"].unique()
num_training_samples = int(len(unique_ids) * (1 - test_size))

# random sample
training_ids = list(random.sample(unique_ids, k=num_training_samples))

# create training and test sets
train_data = df[df["individual_id"].isin(training_ids)].copy()
test_data = df[~df["individual_id"].isin(training_ids)].copy()
solegalli commented 1 year ago

Hi @Morgan-Sell

I don't think this one is suitable for Feature-engine. We are not strictly creating features with this function. So I will close for now.

Not sure this thread is relevant: https://stackoverflow.com/questions/61337373/split-on-train-and-test-separating-by-group ?