ColtAllen / btyd

Buy Till You Die and Customer Lifetime Value statistical models in Python.
https://btyd.readthedocs.io/
Apache License 2.0
113 stars 8 forks source link

Documentation issue #74

Closed SSMK-wq closed 1 year ago

SSMK-wq commented 1 year ago

I see in the documentation the GGM model is mentioned like as below but what is available (auto populates upon tab key) is GammaGammaFitter and not GammaGammaModel as shown in doc below. I guess it should be updated else the import statement doesn't work.

image

ColtAllen commented 1 year ago

Hey @SSMK-wq, if your IDE is anything like mine, GammaGammaFitter is only showing up in auto-populate because it is already used elsewhere in your script or notebook. GammaGammaModel is the newer version of this model. All models suffixed with Fitter will be removed after BTYD is out of beta.

I do see a typo to fix in the import statement though, so thanks for bringing this to my attention.

SSMK-wq commented 1 year ago

So, would we able to use 'GammaGammaModel' now? It doesn't work in import statement. So, it is not available in beta version now?

On Wed, 9 Nov 2022, 21:43 Colt Allen, @.***> wrote:

Hey @SSMK-wq https://github.com/SSMK-wq, if your IDE is anything like mine, GammaGammaFitter is only showing up in auto-populate because it is already used elsewhere in your script or notebook. GammaGammaModel is the newer version of this model. All models suffixed with Fitter will be removed after BTYD is out of beta.

I do see a typo to fix in the import statement though, so thanks for bringing this to my attention.

— Reply to this email directly, view it on GitHub https://github.com/ColtAllen/btyd/issues/74#issuecomment-1308785388, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHKM54MXBOZUKINMYCSCAYLWHOTA7ANCNFSM6AAAAAAR3FKXSE . You are receiving this because you were mentioned.Message ID: @.***>

ColtAllen commented 1 year ago

What library version are you using? It was added in 0.1b2, and I just tried importing it in 0.1b3 and it worked fine.

SSMK-wq commented 1 year ago

I upgraded and now I tried to use 0.1b3 but it threw the below error

AttributeError: 'Series' object has no attribute 'columns'

I pass the same columns as input (which we did for GammaGammafitter()). Meaning, my code looks like below

ggf = GammaGammaModel() # model object updated to GGmodel() instead of GGfitter()
ggf.fit(monetary_cal_df['frequency_cal'],monetary_cal_df['avg_monetary_value_cal']) # model fitting

Full error message looks like as shown below


> ggf = GammaGammaModel() # model object updated to GGmodel() instead of GGfitter()
> ggf.fit(monetary_cal_df['frequency_cal'],monetary_cal_df['avg_monetary_value_cal']) # model fitting
> # Prediction of expected amount of average profit
> monetary_cal_df["expct_avg_spend"] = ggf.conditional_expected_average_profit(monetary_cal_df['frequency_cal'], monetary_cal_df['avg_monetary_value_cal'])
> ```
> 
> AttributeError                            Traceback (most recent call last)
> Input In [64], in <cell line: 2>()
>       1 ggf = GammaGammaModel() # model object updated to GGmodel() instead of GGfitter()
> ----> 2 ggf.fit(monetary_cal_df['frequency_cal'],monetary_cal_df['avg_monetary_value_cal']) # model fitting
>       3 # Prediction of expected amount of average profit
>       4 monetary_cal_df["expct_avg_spend"] = ggf.conditional_expected_average_profit(monetary_cal_df['frequency_cal'], monetary_cal_df['avg_monetary_value_cal'])
> 
> File ~\Anaconda3\lib\site-packages\btyd\models\__init__.py:89, in BaseModel.fit(self, rfm_df, tune, draws)
>      63 def fit(self, rfm_df: pd.DataFrame, tune: int = 1200, draws: int = 1200) -> SELF:
>      64     """
>      65     Fit a custom pymc model with parameter prior definitions to observed RFM data.
>      66 
>    (...)
>      80 
>      81     """
>      83     (
>      84         self._frequency,
>      85         self._recency,
>      86         self._T,
>      87         self._monetary_value,
>      88         _,
> ---> 89     ) = self._dataframe_parser(rfm_df)
>      91     self._check_inputs(
>      92         self._frequency, self._recency, self._T, self._monetary_value
>      93     )
>      95     with self._model():
> 
> File ~\Anaconda3\lib\site-packages\btyd\models\__init__.py:214, in BaseModel._dataframe_parser(self, rfm_df)
>     200 def _dataframe_parser(self, rfm_df: pd.DataFrame) -> Tuple[np.ndarray]:
>     201     """
>     202     Parse input dataframe into separate RFM components. This is an internal method and not intended to be called directly.
>     203 
>    (...)
>     211         Tuple containing numpy arrays for Recency, Frequency, Monetary Value, T, and Customer ID (if provided).
>     212     """
> --> 214     rfm_df.columns = rfm_df.columns.str.upper()
>     216     # The load_cdnow_summary_with_monetary_value() function needs an ID column for testing.
>     217     if "ID" not in rfm_df.columns:
> 
> File ~\Anaconda3\lib\site-packages\pandas\core\generic.py:5575, in NDFrame.__getattr__(self, name)
>    5568 if (
>    5569     name not in self._internal_names_set
>    5570     and name not in self._metadata
>    5571     and name not in self._accessors
>    5572     and self._info_axis._can_hold_identifiers_and_holds_name(name)
>    5573 ):
>    5574     return self[name]
> -> 5575 return object.__getattribute__(self, name)
> 
> AttributeError: 'Series' object has no attribute 'columns'
ColtAllen commented 1 year ago

That's because GammaGammaModel has a streamlined API; the entire summary DF (of repeat customers) is passed in as a single argument rather than individual arrays. Try this instead:

ggf = GammaGammaModel()

# rename columns to `frequency` and `monetary_value` before fitting model
ggm.fit(monetary_cal_df) 

exp_avg_spend = ggm.predict('avg_value')

clv = ggm.predict('clv',
    transaction_prediction_model = bgm, # this is a trained BetaGeoModel(). Fitter models cannot be used
    time = 12,
    discount_rate = 0.01,
    freq = "D",
)

Please the new Model objects take a considerably longer time to train, but they are less likely to overfit and also provide entire probability distributions for model interpretation and predictions:

Below code cells may require editing to run properly

import arviz as az

# Fit proposed new BetaGeo Bayesian model
bgm= BetaGeoModel().fit(rfm_df)

# Use ArviZ to plot posterior parameter distributions against the MLE estimates
axes = az.plot_trace(
    data=bgm._idata,
    var_names=["BetaGeoModel::a", "BetaGeoModel::b", "BetaGeoModel::alpha", "BetaGeoModel::r"],
    compact=True,
    backend_kwargs={
        "figsize": (12, 9),
        "layout": "constrained"
    },
)
fig = axes[0][0].get_figure()
fig.subtitle("BG/NBD Model Trace")

bg_nbd_arviz_plots

# Infer p_alive distributions for each customer:
p_alive_full = bgm.predict('cond_prob_alive', sample_posterior=True)
p_alive = bgm.predict('cond_prob_alive'

# Plotting function to compare results
def plot_conditional_probability_alive(p_alive_full, p_alive, idx, ax):
    sns.kdeplot(x=p_alive_full[idx], color="C0", fill=True, ax=ax)
    ax.axvline(x=p_alive[idx], color="C1", linestyle="--")
    ax.set(title=f"idx={idx}")
    return ax

fig, axes = plt.subplots(
    nrows=3,
    ncols=3,
    figsize=(9, 9),
    layout="constrained"
)
for idx, ax in enumerate(axes.flatten()):
    plot_conditional_probability_alive(p_alive_full , p_alive, idx, ax)

fig.subtitle("Conditional Probability Alive", fontsize=16)

bg_nbd_pymc_prob_alive_plots

More information is provided in these PR writeups: https://github.com/ColtAllen/btyd/pull/24, https://github.com/ColtAllen/btyd/pull/33

SSMK-wq commented 1 year ago

@ColtAllen - What is avg_value in ggm.predict()? from where do we get that column? You mean the monetary_value?

And what is rfm_df in BetaGeoModel().fit(rfm_df)? From where do you get this rfm_df.

Sorry, I couldn't find them in documentation. Apologies if I missed it.

ColtAllen commented 1 year ago
 #avg value is not column. It's a string identifier for the predictive method
conditional_expected_average_profit = ggm.predict(method = 'avg_value`)

rfm_df = summary_data_from_transaction_data(*args)

This conversation has also inspired me to refactor calibration_and_holdout_data so that it outputs separate calibration and holdout dataframes, because having to rename the columns in order to use it with any of the new models is cumbersome. While I'm at it I'll also rename it from calibration/holdout to train/test, because most people are more familiar with the latter convention.

Also, my bad -GammaGammaModel and ModBetaGeoModel aren't showing up in the API Reference. I'll get that updated ASAP.

ColtAllen commented 1 year ago

Changes have been made to documentation. If there's nothing else, I'm gonna close this issue.

SSMK-wq commented 1 year ago

Apologies for the delay. As I have been traveling, couldn't attend to this earlier.