datalev001 / tm_lifetime

1 stars 2 forks source link

trans_date <> transaction_date #2

Open MilesAheadToo opened 1 month ago

MilesAheadToo commented 1 month ago

This is your code:

` Converted into a standardized datetime format tran_df['InvoiceDate'] = pd.to_datetime(tran_df['InvoiceDate']) tran_df['transaction_date'] = tran_df['InvoiceDate'].dt.date

cats_top = tran_df.Description.value_counts().reset_index() cats_top_df = cats_top[cats_top['count']>1000]

Filtered to keep only the high-frequency items:

pro_lst = list(set(cats_top_df['Description'])) tran_df_sel = tran_df[tran_df['Description'].isin(pro_lst)] cols = ['Customer ID', 'Description', 'trans_date', 'Quantity']

data to be used

tran_df_bs = tran_df_sel[cols]`

I do not see any reference to trans_date until the last few rows. My code stops with an error at tran_df_bs = tran_df_sel[cols] because there is column called trans_date.

Is this an error?

datalev001 commented 1 month ago

Sorry, I there are two rows missing: https://github.com/datalev001/tm_lifetime/edit/main/code/repurchase_prophet_tm.py , see the red codes below:

Load the dataset with proper encoding

tran_df = pd.read_csv('online_retail_II.csv', encoding= "latin1")

This step filters out rows that contain missing or invalid values in the

key columns. c1 = (tran_df['Invoice'].isnull() == False) c2 = (tran_df['Quantity']>0) c3 = (tran_df['Customer ID'].isnull() == False) c4 = (tran_df['StockCode'].isnull() == False) c5 = (tran_df['Description'].isnull() == False) tran_df = tran_df[c1 & c2 & c3 & c4 & c5]

This step involves further cleaning and filtering

grp = ['Invoice', 'StockCode','Description', 'Quantity', 'InvoiceDate']

Duplicate transactions are removed

tran_df = tran_df.drop_duplicates(grp)

Converted into a standardized datetime format

tran_df['InvoiceDate'] = pd.to_datetime(tran_df['InvoiceDate']) tran_df['transaction_date'] = tran_df['InvoiceDate'].dt.date

choose products with higher transactions

cats_top = tran_df.Description.value_counts().reset_index() cats_top.columns = ['Description', 'count'] cats_top_df = cats_top[cats_top['count']>1000]

Filtered to keep only the high-frequency items:

pro_lst = list(set(cats_top_df['Description'])) tran_df_sel = tran_df[tran_df['Description'].isin(pro_lst)] tran_df_sel['trans_date'] = pd.to_datetime(tran_df_sel['transaction_date'], format = '%Y-%m-%d') cols = ['Customer ID', 'Description', 'trans_date', 'Quantity']

data to be used

tran_df_bs = tran_df_sel[cols]

On Mon, Sep 30, 2024 at 1:53 AM Trevor Miles @.***> wrote:

This is your code:

` Converted into a standardized datetime format tran_df['InvoiceDate'] = pd.to_datetime(tran_df['InvoiceDate']) tran_df['transaction_date'] = tran_df['InvoiceDate'].dt.date

cats_top = tran_df.Description.value_counts().reset_index() cats_top_df = cats_top[cats_top['count']>1000] Filtered to keep only the high-frequency items:

pro_lst = list(set(cats_top_df['Description'])) tran_df_sel = tran_df[tran_df['Description'].isin(pro_lst)] cols = ['Customer ID', 'Description', 'trans_date', 'Quantity'] data to be used

tran_df_bs = tran_df_sel[cols]`

I do not see any reference to trans_date until the last few rows. My code stops with an error at tran_df_bs = tran_df_sel[cols] because there is column called trans_date.

Is this an error?

— Reply to this email directly, view it on GitHub https://github.com/datalev001/tm_lifetime/issues/2, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVGRSV36PTLGHV67OOCSGTDZZDRNZAVCNFSM6AAAAABPCQD5KGVHI2DSMVQWIX3LMV43ASLTON2WKOZSGU2TKNRTGAZTOOA . You are receiving this because you are subscribed to this thread.Message ID: @.***>