Generation of synthetic data

ryonsd commented 3 years ago

Dear Dr. Han

I'm trying to implement choice model proposed in your paper "A Neural-embedded Choice Model: TasteNet-MNL Modeling Taste Heterogeneity with Flexibility and Interpretability", and stuck in the part of generating synthetic data.

I cannot reproduce same result. Specifically, the accuracy is the same for MNL and MNL_true. Should I scale any variables? I’d appreciate it if you could share advice or code.

Regards,

zhongguodeshizhe007 commented 2 years ago

Hi, Can you share your processed data? The code shared by the author does not involve data processing.

Dear Dr. Han

I'm trying to implement choice model proposed in your paper "A Neural-embedded Choice Model: TasteNet-MNL Modeling Taste Heterogeneity with Flexibility and Interpretability", and stuck in the part of generating synthetic data.

I cannot reproduce same result. Specifically, the accuracy is the same for MNL and MNL_true. Should I scale any variables? I’d appreciate it if you could share advice or code.

Regards,

Hi, Can you share your processed data? The code shared by the author does not involve data processing. I’d appreciate it if you could share advice or code. Regards,

ryonsd commented 2 years ago

Thank you for your reply.

This is my code to generate data.


import numpy as np
import pandas as pd
from scipy import stats

'''
generate data according to section 4.1. Syntetic data and Appendix A
alternatives are 0 and 1
'''

df = pd.DataFrame(columns = ["CHOICE", "COST_0", "COST_1", "TIME_0", "TIME_1", "INC", "FULL", "FLEX"])

N = 10000

df.FULL = stats.bernoulli.rvs(p=0.5, size=N)
df.FLEX = stats.bernoulli.rvs(p=0.5, size=N)
df.INC = df.FULL.apply(lambda x: np.random.lognormal(np.log(0.5), 0.25, 1)[0] if x == 1 else np.random.lognormal(np.log(0.25), 0.2, 1)[0])

df.COST_0 = (40 - 0.2) * np.random.rand(N) + 0.2
df.COST_1 = (40 - 0.2) * np.random.rand(N) + 0.2

df.TIME_0 = (90 - 1) * np.random.rand(N) + 1
df.TIME_1 = (90 - 1) * np.random.rand(N) + 1

# scaling *** necessary? ***
# df.COST_0 /= 10
# df.COST_1 /= 10
# df.TIME_0 /= 10
# df.TIME_1 /= 10

# choice probability
P_0_list = []
for i in range(len(df)):
    x = df.iloc[i]
    beta_time = -0.1 - 0.5*x.INC - 0.1*x.FULL + 0.05*x.FLEX \
                - 0.2*x.INC*x.FULL + 0.05*x.INC*x.FLEX + 0.1*x.FULL*x.FLEX
    V_0 = - x.COST_0 + beta_time * x.TIME_0
    V_1 = -0.1 - x.COST_1 + beta_time * x.TIME_1

    P_0 = np.exp(V_0) / (np.exp(V_0) + np.exp(V_1))
    P_0_list.append(P_0)

df["P_0"] = P_0_list

# choice alternative & calculate accuracy
df.CHOICE = df.P_0.apply(lambda x: np.random.choice([0, 1], p=[x, 1-x]))
# MNL-TRUE accuracy
df["CHOICE_PRE"] = df.P_0.apply(lambda x: 0 if x >= 0.5 else 1)
ACC = sum(df.CHOICE == df.CHOICE_PRE) / len(df)

df = df.iloc[:,:-2]

# print(df.CHOICE.value_counts())

df.to_csv("./synthetic_all.csv", index=0) 

## split data
train, val, test = np.split(df.sample(frac=1, random_state=42), [6000, 6000+2000])

train.to_csv("./synthetic_train.csv", index=0)
val.to_csv("./synthetic_validation.csv", index=0)
test.to_csv("./synthetic_test.csv", index=0)

zhongguodeshizhe007 commented 2 years ago

Thank you so much for your reply! Your kindness helped me a lot.

------------------ 原始邮件 ------------------ 发件人: "YafeiHan-MIT/TasteNet-MNL" @.>; 发送时间: 2021年11月1日(星期一) 中午11:36 @.>; 抄送: "Eddy @.**@.>; 主题: Re: [YafeiHan-MIT/TasteNet-MNL] Generation of synthetic data (#1)

Thank you for your reply.

This is my code to generate data. import numpy as np import pandas as pd from scipy import stats ''' generate data according to section 4.1. Syntetic data and Appendix A alternatives are 0 and 1 ''' df = pd.DataFrame(columns = ["CHOICE", "COST_0", "COST_1", "TIME_0", "TIME_1", "INC", "FULL", "FLEX"]) N = 10000 df.FULL = stats.bernoulli.rvs(p=0.5, size=N) df.FLEX = stats.bernoulli.rvs(p=0.5, size=N) df.INC = df.FULL.apply(lambda x: np.random.lognormal(np.log(0.5), 0.25, 1)[0] if x == 1 else np.random.lognormal(np.log(0.25), 0.2, 1)[0]) df.COST_0 = (40 - 0.2) np.random.rand(N) + 0.2 df.COST_1 = (40 - 0.2) np.random.rand(N) + 0.2 df.TIME_0 = (90 - 1) np.random.rand(N) + 1 df.TIME_1 = (90 - 1) np.random.rand(N) + 1 # scaling necessary? # df.COST_0 /= 10 # df.COST_1 /= 10 # df.TIME_0 /= 10 # df.TIME_1 /= 10 # choice probability P_0_list = [] for i in range(len(df)): x = df.iloc[i] beta_time = -0.1 - 0.5x.INC - 0.1x.FULL + 0.05x.FLEX \ - 0.2x.INCx.FULL + 0.05x.INCx.FLEX + 0.1x.FULLx.FLEX V_0 = - x.COST_0 + beta_time x.TIME_0 V_1 = -0.1 - x.COST_1 + beta_time * x.TIME_1 P_0 = np.exp(V_0) / (np.exp(V_0) + np.exp(V_1)) P_0_list.append(P_0) df["P_0"] = P_0_list # choice alternative & calculate accuracy df.CHOICE = df.P_0.apply(lambda x: np.random.choice([0, 1], p=[x, 1-x])) # MNL-TRUE accuracy df["CHOICE_PRE"] = df.P_0.apply(lambda x: 0 if x >= 0.5 else 1) ACC = sum(df.CHOICE == df.CHOICE_PRE) / len(df) df = df.iloc[:,:-2] # print(df.CHOICE.value_counts()) df.to_csv("./synthetic_all.csv", index=0) ## split data train, val, test = np.split(df.sample(frac=1, random_state=42), [6000, 6000+2000]) train.to_csv("./synthetic_train.csv", index=0) val.to_csv("./synthetic_validation.csv", index=0) test.to_csv("./synthetic_test.csv", index=0)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

zhongguodeshizhe007 commented 2 years ago

Thank you for your reply.

This is my code to generate data.

import numpy as np
import pandas as pd
from scipy import stats

'''
generate data according to section 4.1. Syntetic data and Appendix A
alternatives are 0 and 1
'''

df = pd.DataFrame(columns = ["CHOICE", "COST_0", "COST_1", "TIME_0", "TIME_1", "INC", "FULL", "FLEX"])

N = 10000

df.FULL = stats.bernoulli.rvs(p=0.5, size=N)
df.FLEX = stats.bernoulli.rvs(p=0.5, size=N)
df.INC = df.FULL.apply(lambda x: np.random.lognormal(np.log(0.5), 0.25, 1)[0] if x == 1 else np.random.lognormal(np.log(0.25), 0.2, 1)[0])

df.COST_0 = (40 - 0.2) * np.random.rand(N) + 0.2
df.COST_1 = (40 - 0.2) * np.random.rand(N) + 0.2

df.TIME_0 = (90 - 1) * np.random.rand(N) + 1
df.TIME_1 = (90 - 1) * np.random.rand(N) + 1

# scaling *** necessary? ***
# df.COST_0 /= 10
# df.COST_1 /= 10
# df.TIME_0 /= 10
# df.TIME_1 /= 10

# choice probability
P_0_list = []
for i in range(len(df)):
    x = df.iloc[i]
    beta_time = -0.1 - 0.5*x.INC - 0.1*x.FULL + 0.05*x.FLEX \
                - 0.2*x.INC*x.FULL + 0.05*x.INC*x.FLEX + 0.1*x.FULL*x.FLEX
    V_0 = - x.COST_0 + beta_time * x.TIME_0
    V_1 = -0.1 - x.COST_1 + beta_time * x.TIME_1

    P_0 = np.exp(V_0) / (np.exp(V_0) + np.exp(V_1))
    P_0_list.append(P_0)

df["P_0"] = P_0_list

# choice alternative & calculate accuracy
df.CHOICE = df.P_0.apply(lambda x: np.random.choice([0, 1], p=[x, 1-x]))
# MNL-TRUE accuracy
df["CHOICE_PRE"] = df.P_0.apply(lambda x: 0 if x >= 0.5 else 1)
ACC = sum(df.CHOICE == df.CHOICE_PRE) / len(df)

df = df.iloc[:,:-2]

# print(df.CHOICE.value_counts())

df.to_csv("./synthetic_all.csv", index=0) 

## split data
train, val, test = np.split(df.sample(frac=1, random_state=42), [6000, 6000+2000])

train.to_csv("./synthetic_train.csv", index=0)
val.to_csv("./synthetic_validation.csv", index=0)
test.to_csv("./synthetic_test.csv", index=0)

I ran the code you provided, but the generated data still doesn't match the structure of the original code. Can you share your code and the data used?

YafeiHan-MIT / TasteNet-MNL

Generation of synthetic data #1