.predict() data, needs same number of samples as .fit() with kind="response"

sepro commented 2 months ago

I stumbled upon an issue that I believe was not present in earlier versions (~0.10)

I have data in reference_df where I fit a model on, using this model I would like to do predictions on additional data stored in target_df.

  formula = "c(a, b, c, d) ~ age + BMI"
  model = bmb.Model(
      formula,
      reference_df,
      family="multinomial",
  )
  trace = model.fit(
      random_seed=0, chains=4, draws=1000, cores=4
  )

  model.predict(
                trace,
                data=target_df.sample(reference_df.shape[0], replace=True),
                kind="response",
            )

Then I pick the trace's posterior_predictive apart to get the stats I need.

This used to work (back when kind="pps" was still the norm) without the .sample() step to get the same number of observations in both groups.

Is this intentional or did I stumble upon a bug or is this on me? For my use-case I don't think it matters much, though it probably adds a bit more randomness and uncertainty to the results.

GStechschulte commented 2 months ago

Hey @sepro thanks for raising the issue. What version of Bambi are you using?

I have the current main branch installed, and I cannot reproduce the error with regression models where family is gaussian or categorical (I don't have a multinomial example at my disposal). Would it be possible for you to provide a sample of the data to reproduce the error?

Thanks!

sepro commented 2 months ago

Thanks for the swift reply @GStechschulte , I have some code to generate synthetic data (see below). I'm using bambi 0.14.0 with pymc 5.16.2.

If I don't include the .sample step to ensure the number of observations are identical I get a ValueError

ValueError: Could not broadcast dimensions. Incompatible shapes were [(ScalarConstant(ScalarType(int64), data=1), ScalarConstant(ScalarType(int64), data=1), ScalarConstant(ScalarType(int64), data=200)), (ScalarConstant(ScalarType(int64), data=4), ScalarConstant(ScalarType(int64), data=1000), ScalarConstant(ScalarType(int64), data=50))].

import pandas as pd
import numpy as np

def build_data(labels=["a", "b", "c", "d"]):
    ref_size = 100
    target_size = 50
    ref_prevalence = [0.3, 0.3, 0.2, 0.2]
    target_prevalence = [0.3, 0.2, 0.3, 0.2]

    ref_df = pd.DataFrame(
        {
            "age": np.random.randint(18, high=80, size=ref_size),
            "BMI": np.random.normal(25, size=ref_size),
            "label": np.random.choice(
                labels, size=ref_size, replace=True, p=ref_prevalence
            ),
        }
    )
    target_df = pd.DataFrame(
        {
            "age": np.random.randint(18, high=80, size=target_size),
            "BMI": np.random.normal(25, size=target_size),
            "label": np.random.choice(
                labels, size=target_size, replace=True, p=target_prevalence
            ),
        }
    )

    # one hot encoding needed for model
    for label in labels:
        ref_df[label] = ref_df["label"].apply(lambda x: 1 if x == label else 0)
        target_df[label] = target_df["label"].apply(lambda x: 1 if x == label else 0)

    return ref_df, target_df

tomicapretto commented 2 months ago

@sepro I'm trying to solve this. Just want to double-check this is the error you're experiencing.

import bambi as bmb
import pandas as pd
import numpy as np

def build_data(labels=["a", "b", "c", "d"]):
    ref_size = 100
    target_size = 50
    ref_prevalence = [0.3, 0.3, 0.2, 0.2]
    target_prevalence = [0.3, 0.2, 0.3, 0.2]

    ref_df = pd.DataFrame(
        {
            "age": np.random.randint(18, high=80, size=ref_size),
            "BMI": np.random.normal(25, size=ref_size),
            "label": np.random.choice(
                labels, size=ref_size, replace=True, p=ref_prevalence
            ),
        }
    )
    target_df = pd.DataFrame(
        {
            "age": np.random.randint(18, high=80, size=target_size),
            "BMI": np.random.normal(25, size=target_size),
            "label": np.random.choice(
                labels, size=target_size, replace=True, p=target_prevalence
            ),
        }
    )

    # one hot encoding needed for model
    for label in labels:
        ref_df[label] = ref_df["label"].apply(lambda x: 1 if x == label else 0)
        target_df[label] = target_df["label"].apply(lambda x: 1 if x == label else 0)

    return ref_df, target_df

reference_df, target_df = build_data()
formula = "c(a, b, c, d) ~ age + BMI"
model = bmb.Model(
    formula,
    reference_df,
    family="multinomial",
)
trace = model.fit(random_seed=0)
model.predict(
    trace, 
    data=target_df.sample(20, replace=True), 
    kind="response", 
    inplace=False
)

ValueError: Could not broadcast dimensions. Incompatible shapes were [(ScalarConstant(ScalarType(int64), data=1), ScalarConstant(ScalarType(int64), data=1), ScalarConstant(ScalarType(int64), data=100)), (ScalarConstant(ScalarType(int64), data=2), ScalarConstant(ScalarType(int64), data=1000), ScalarConstant(ScalarType(int64), data=20))].

tomicapretto commented 2 months ago

It's definitely a bug. Have a look at this

https://github.com/bambinos/bambi/blob/516d7bd4fcfd2aea1b355f6873f45271ad48cf71/bambi/families/multivariate.py#L38-L41

There we're always getting 'n' from the original dataset, that's why it needs a data frame with the same number of rows. This problem is also present for DirichletMultinomial

We need to do something similar to what we do in binomial-like families, where we check whether we receive new data before determining the values for 'n'.

https://github.com/bambinos/bambi/blob/516d7bd4fcfd2aea1b355f6873f45271ad48cf71/bambi/families/univariate.py#L15-L23

edit I may be able to work on it soon-ish

sepro commented 2 months ago

@sepro I'm trying to solve this. Just want to double-check this is the error you're experiencing.

ValueError: Could not broadcast dimensions. Incompatible shapes were [(ScalarConstant(ScalarType(int64), data=1), ScalarConstant(ScalarType(int64), data=1), ScalarConstant(ScalarType(int64), data=100)), (ScalarConstant(ScalarType(int64), data=2), ScalarConstant(ScalarType(int64), data=1000), ScalarConstant(ScalarType(int64), data=20))].

Hi @tomicapretto , yes that is indeed the same error I got.

Let me know if I can help you out by testing a branch with a solution. After playing around with more synthetic data, I noticed this does impact my analysis more than I'm comfortable with. So I'm happy to help sorting this out any way I can.

tomicapretto commented 1 month ago

@sepro could you install from #847 and let me know if it's solving your issue?

sepro commented 1 month ago

Hi @tomicapretto, I've successfully installed #847 and can confirm it solves the issue!

Thanks for the swift intervention!

bambinos / bambi

.predict() data, needs same number of samples as .fit() with kind="response" #845