Unable to predict and fit when using categorical family

maxtheman commented 3 weeks ago

Hello there,

I have a simple model and several issues. I am unclear if they are related or not.

The first involves group effects

model = bmb.Model(
    "gold_label ~ (predictor_a | predictor_b)",
    data=train_df,
    family="cumulative",
)
trace = model.fit()

With this data:

(     gold_label                                predictor_a  \
 1606                      4.0                               2   
 717                       1.0                               1   
 1412                      1.0                               1   
 970                       4.0                               2   
 231                       2.0                               1   
 ...                       ...                             ...   
 465                       3.0                               1   
 823                       2.0                               1   
 84                        4.0                               1   
 1339                      4.0                               2   
 1273                      4.0                               1   

       predictor_b  
 1606                                       0  
 717                                        1  
 1412                                       1  
 970                                        0  
 231                                        1  
 ...                                      ...  
 465                                        0  
 823                                        0  
 84                                         1  
 1339                                       1  
 1273                                       1  

 [172 rows x 3 columns],)

When I try to predict on new data, with the exact same shape and categories (it is just a subset of the original data frame). I get an error:

prediction = model.predict(
    trace,
    data=test_df,
    kind="response",
    inplace=False,
)

ValueError: need at least one array to concatenate

The error occurred in the following call stack:

bambi/models.py:867 in Model.predict()
-> bambi/models.py:972 in Model._compute_likelihood_params()
  -> bambi/model_components.py:155 in DistributionalComponent.predict()
    -> bambi/model_components.py:199 in DistributionalComponent.predict_common()
      -> formulae/matrices.py:258 in CommonEffectsMatrix.evaluate_new_data()
        -> numpy/lib/shape_base.py:652 in column_stack()
          -> numpy.concatenate()

This does not happen if I eliminate the group effects from the model.

Is this a bug? Or am I doing something wrong here?

Thank you for your help.

maxtheman commented 3 weeks ago

My second issue is related to the cardinality of the value I'm trying to infer.

I don't know if this is related to issue 1, but to avoid creating dupes I will post it here for now.

example_data = pd.DataFrame({
    "to_predict": [2, 3, 4],
    "predictor": [1, 0, 1],
})
test_model = bmb.Model(
    "to_predict ~ predictor",
    data=example_data,
    family="sequential",
)

Running this returns the error:

IndexError: tuple index out of range

The error occurred in the following call stack:

bambi/models.py:227 in Model.__init__()
-> bambi/models.py:423 in Model._build_priors()
  -> bambi/priors/scaler.py:142 in PriorScaler.scale()
    -> bambi/priors/scaler.py:106 in PriorScaler.scale_threshold()
      -> bambi/terms/response.py:29 in ResponseTerm.data()
        -> bambi/families/univariate.py:179 in Cumulative.get_data()

Changing predicted variable to categorical causes the code to pass but then on fitting there is a new error.

example_data = pd.DataFrame({
    "to_predict": pd.Categorical([2, 3, 4], ordered=True),
    "predictor": [1, 0, 1],
})
test_model = bmb.Model(
    "to_predict ~ predictor",
    data=example_data,
    family="cumulative",
)
example_fitted = test_model.fit()

AssertionError: 

The error occurred in the following call stack:

bambi/models.py:348 in Model.fit()
-> bambi/backend/pymc.py:131 in PyMCModel.run()
  -> bambi/backend/pymc.py:209 in PyMCModel._run_mcmc()
    -> pymc/sampling/mcmc.py:718 in sample()
      -> pymc/sampling/mcmc.py:223 in assign_step_methods()
        -> pytensor/gradient.py:633 in grad()
          -> pytensor/gradient.py:1425 in _populate_grad_dict()
            -> pytensor/gradient.py:1380 in access_grad_cache()
              -> pytensor/gradient.py:1057 in access_term_cache()
                -> pytensor/gradient.py:1210 in access_term_cache()
                  -> pytensor/graph/op.py:398 in Op.L_op()
                    -> pytensor/tensor/subtensor.py:1995 in IncSubtensor.grad()
                      -> pytensor/tensor/subtensor.py:2031 in _sum_grad_over_bcasted_dims()

This error does go away if I reduce it to two categories (only predicting 'a's and 'b's or 1's and 0's for example) or increase it to 4 categories.

I tried playing around with the types and that of the predictors as well as the value of predictors but wasn't able to find any other patterns.

Cumulative just really doesn't like having 3 categories for some reason.

If you want me to split this out into a separate ticket, just let me know. I thought it might be related, so putting it here. The code above essentially is the same example as I provided in my initial comment.

If this is not an error on my end and is in fact a bug, if you can point me in the right direction, I'm happy to try to submit a fix.

tomicapretto commented 2 weeks ago

Hi @maxtheman, thanks for reporting these issues. I'm still investigating, but I can add one thing and ask for another.

The "cumulative" family expects a categorical response, that's why the numeric one is not working, because Bambi is not interpreting the responses as categories. In the future we could handle this internally, but for now it's the user's responsibility.
Could you provide a reproducible example for the first problem? Thanks!

tomicapretto commented 2 weeks ago

Ok, I found the root of the problem. It's connected to the usage of dims in a distribution with a transformation.

This is the implementation in PyMC

import numpy as np
import pymc as pm
import pytensor.tensor as pt

coords = {
    "threshold_dim": [0, 1],
    "to_predict_dim": [0, 1, 2],
    "__obs__": [0, 1, 2],
}

predictor = np.array([1, 0, 1])
observed = np.array([0, 1, 2])

with pm.Model(coords=coords) as model:
    b_predictor = pm.Normal("b_predictor")
    threshold = pm.Normal(
        "threshold",
        mu=[-2, 2],
        sigma=1,
        transform=pm.distributions.transforms.ordered,
        # dims="threshold_dim" # If this is commented out, we get the assertion error
    )

    eta = b_predictor * np.array([1, 0, 1])
    eta_shifted = threshold - pt.shape_padright(eta)
    p = pm.math.sigmoid(eta_shifted)
    p = pt.concatenate(
        [
            pt.shape_padright(p[..., 0]),
            p[..., 1:] - p[..., :-1],
            pt.shape_padright(1 - p[..., -1]),
        ],
        axis=-1,
    )

    p = pm.Deterministic("p", p, dims=("__obs__", "to_predict_dim"))

    pm.Categorical("to_predict", p=p, observed=observed, dims="__obs__")

with model:
    idata = pm.sample()

maxtheman commented 2 weeks ago

Thank you for the reply @tomicapretto. I can drop down to PyMC as a workaround to the second issue for now.

I am a little unclear still, is this a bug in Bambi? Or a user error with an ambiguous message?

Based on your note I am assuming the dims should be passed somewhere in here, perhaps conditionally, but aren't right now: https://github.com/bambinos/bambi/blob/46d5572b52940b8e07c0c6cfd0f0bb24eb83c233/bambi/backend/pymc.py#L207 if I am understanding that correctly, I'm happy to try to submit a PR resolving it.

Regarding the first error:

The following code tries all possible type combinations possible and actually reproduces both errors successfully, but "Error 1" relates to the first error in particular.

def generate_type_combinations(example_data):
    columns = example_data.columns
    import itertools
    type_combinations = list(itertools.product([True, False], repeat=len(columns)))
    all_variants = []
    for combo in type_combinations:
        df_variant = example_data.copy()
        combo_dict = {}
        for col, is_categorical in zip(columns, combo):
            if is_categorical:
                df_variant[col] = pd.Categorical(df_variant[col], ordered=True)
                combo_dict[col] = 'categorical'
            else:
                df_variant[col] = df_variant[col].astype(float)
                combo_dict[col] = 'numeric'
        all_variants.append({
            'data': df_variant,
            'types': combo_dict
        })
    return all_variants

example_data = pd.DataFrame({
    "to_predict": [2, 3, 4, 5],
    "predictor_a": [1, 0, 1, 0],
    "predictor_b": [1, 1, 0, 0],
})

variants = generate_type_combinations(example_data)

for variant in variants:
    try:
        print("\nTrying combination:", variant['types'])
        test_model = bmb.Model(
            "to_predict ~ (predictor_a | predictor_b)",
            data=variant['data'],
            family="cumulative"
        )
        test_idata = test_model.fit()
        variant_copy = variant.copy()
        test_model.predict(test_idata, data=variant_copy["data"], inplace=True)
        print("✓ Success!")
    except Exception as e:
        variant_copy["data"].info()
        if "need at least one array to concatenate" in str(e):
            print("✗ Error 1", str(e))
        elif "tuple index out of range" in str(e):
            print("✗ Error 2", str(e))
        else:
            raise e

tomicapretto commented 2 weeks ago

This is the issue btw https://github.com/pymc-devs/pymc/issues/7554

maxtheman commented 2 weeks ago

Thanks @tomicapretto . I see what you mean now. I'll subscribe there to follow along.

Let me know if you need anything else on the original error, hopefully that example helps to clarify.

tomicapretto commented 1 week ago

@maxtheman if you upgrade to PyTensor 2.26, it should get fixed

maxtheman commented 1 week ago

Amazing, thank you so much! I will close this issue and reopen it if that doesn't work. I'm working on a different part of the project right now, so I haven't quite had time to wrap back around the modeling aspect.

bambinos / bambi

Unable to predict and fit when using categorical family #852