Closed sepro closed 1 month ago
Hey @sepro thanks for raising the issue. What version of Bambi are you using?
I have the current main branch installed, and I cannot reproduce the error with regression models where family is gaussian
or categorical
(I don't have a multinomial
example at my disposal). Would it be possible for you to provide a sample of the data to reproduce the error?
Thanks!
Thanks for the swift reply @GStechschulte , I have some code to generate synthetic data (see below). I'm using bambi 0.14.0 with pymc 5.16.2.
If I don't include the .sample step to ensure the number of observations are identical I get a ValueError
ValueError: Could not broadcast dimensions. Incompatible shapes were [(ScalarConstant(ScalarType(int64), data=1), ScalarConstant(ScalarType(int64), data=1), ScalarConstant(ScalarType(int64), data=200)), (ScalarConstant(ScalarType(int64), data=4), ScalarConstant(ScalarType(int64), data=1000), ScalarConstant(ScalarType(int64), data=50))].
import pandas as pd
import numpy as np
def build_data(labels=["a", "b", "c", "d"]):
ref_size = 100
target_size = 50
ref_prevalence = [0.3, 0.3, 0.2, 0.2]
target_prevalence = [0.3, 0.2, 0.3, 0.2]
ref_df = pd.DataFrame(
{
"age": np.random.randint(18, high=80, size=ref_size),
"BMI": np.random.normal(25, size=ref_size),
"label": np.random.choice(
labels, size=ref_size, replace=True, p=ref_prevalence
),
}
)
target_df = pd.DataFrame(
{
"age": np.random.randint(18, high=80, size=target_size),
"BMI": np.random.normal(25, size=target_size),
"label": np.random.choice(
labels, size=target_size, replace=True, p=target_prevalence
),
}
)
# one hot encoding needed for model
for label in labels:
ref_df[label] = ref_df["label"].apply(lambda x: 1 if x == label else 0)
target_df[label] = target_df["label"].apply(lambda x: 1 if x == label else 0)
return ref_df, target_df
@sepro I'm trying to solve this. Just want to double-check this is the error you're experiencing.
import bambi as bmb
import pandas as pd
import numpy as np
def build_data(labels=["a", "b", "c", "d"]):
ref_size = 100
target_size = 50
ref_prevalence = [0.3, 0.3, 0.2, 0.2]
target_prevalence = [0.3, 0.2, 0.3, 0.2]
ref_df = pd.DataFrame(
{
"age": np.random.randint(18, high=80, size=ref_size),
"BMI": np.random.normal(25, size=ref_size),
"label": np.random.choice(
labels, size=ref_size, replace=True, p=ref_prevalence
),
}
)
target_df = pd.DataFrame(
{
"age": np.random.randint(18, high=80, size=target_size),
"BMI": np.random.normal(25, size=target_size),
"label": np.random.choice(
labels, size=target_size, replace=True, p=target_prevalence
),
}
)
# one hot encoding needed for model
for label in labels:
ref_df[label] = ref_df["label"].apply(lambda x: 1 if x == label else 0)
target_df[label] = target_df["label"].apply(lambda x: 1 if x == label else 0)
return ref_df, target_df
reference_df, target_df = build_data()
formula = "c(a, b, c, d) ~ age + BMI"
model = bmb.Model(
formula,
reference_df,
family="multinomial",
)
trace = model.fit(random_seed=0)
model.predict(
trace,
data=target_df.sample(20, replace=True),
kind="response",
inplace=False
)
ValueError: Could not broadcast dimensions. Incompatible shapes were [(ScalarConstant(ScalarType(int64), data=1), ScalarConstant(ScalarType(int64), data=1), ScalarConstant(ScalarType(int64), data=100)), (ScalarConstant(ScalarType(int64), data=2), ScalarConstant(ScalarType(int64), data=1000), ScalarConstant(ScalarType(int64), data=20))].
It's definitely a bug. Have a look at this
There we're always getting 'n' from the original dataset, that's why it needs a data frame with the same number of rows.
This problem is also present for DirichletMultinomial
We need to do something similar to what we do in binomial-like families, where we check whether we receive new data before determining the values for 'n'.
edit I may be able to work on it soon-ish
@sepro I'm trying to solve this. Just want to double-check this is the error you're experiencing.
ValueError: Could not broadcast dimensions. Incompatible shapes were [(ScalarConstant(ScalarType(int64), data=1), ScalarConstant(ScalarType(int64), data=1), ScalarConstant(ScalarType(int64), data=100)), (ScalarConstant(ScalarType(int64), data=2), ScalarConstant(ScalarType(int64), data=1000), ScalarConstant(ScalarType(int64), data=20))].
Hi @tomicapretto , yes that is indeed the same error I got.
Let me know if I can help you out by testing a branch with a solution. After playing around with more synthetic data, I noticed this does impact my analysis more than I'm comfortable with. So I'm happy to help sorting this out any way I can.
@sepro could you install from #847 and let me know if it's solving your issue?
Hi @tomicapretto, I've successfully installed #847 and can confirm it solves the issue!
Thanks for the swift intervention!
I stumbled upon an issue that I believe was not present in earlier versions (~0.10)
I have data in
reference_df
where I fit a model on, using this model I would like to do predictions on additional data stored intarget_df
.Then I pick the trace's posterior_predictive apart to get the stats I need.
This used to work (back when kind="pps" was still the norm) without the
.sample()
step to get the same number of observations in both groups.Is this intentional or did I stumble upon a bug or is this on me? For my use-case I don't think it matters much, though it probably adds a bit more randomness and uncertainty to the results.