bambinos / bambi

BAyesian Model-Building Interface (Bambi) in Python.
https://bambinos.github.io/bambi/
MIT License
1.07k stars 122 forks source link

ValueError: Dataset does not contain the dimensions: {'y_mean_obs'} #485

Closed Sum02dean closed 2 years ago

Sum02dean commented 2 years ago

OS: Linux Bambi: 0.7.1 Python: 3.8

Issue: I am receiving the above issues when using models and idata bambi objects when they are stored inside a dictionary or list. When they are not stored inside a collection, the expected behavior is observed. This is a recent issue since installing bambi 0.7.1 (my code worked previously).


function code ...
....

# Define model parameters
params = {
    'family': 'bernoulli',
    'chains': 3,
    'draws': 10,
    'tune': 10}

models = []
classifiers = []
for i in range(len(train_splits)):
        print("\nComputing predictions for sampling run {}".format(i + 1))
        x_train, y_train = train_splits[i]
        x_test, y_test = test_splits[i]

        # Run bambi model
        x_train['y'] = y_train.values

        # Get the function formula
        f = get_formula(x_train.columns[:-1])

        model = bmb.Model(f, x_train,  family=params['family'])
        clf = model.fit(draws=params['draws'], tune=params['tune'],
                        chains=params['chains'], init='auto')

        models.append(model)
        classifiers.append(clf)

        # Run predictions
        idata = model.predict(clf, data=x_test, inplace=False)
        mean_preds = idata.posterior["y_mean"].values
        predictions.append(mean_preds)

        # Collect outputs
        output_dict = {
            'predictions': predictions,
            'models': models,
            'classifiers': classifiers,
            'train_splits': train_splits,
            'test_splits': test_splits
        }
    return output_dict

With a single iteration I get a model of:

print(output['models'][0])

output:

Formula: y ~ neighborhood_transferred + fusion + cooccurence + coexpression + coexpression_transferred + experiments + experiments_transferred + database + database_transferred + textmining + textmining_transferred
Family name: Bernoulli
Link: logit
Observations: 127151
Priors:
  Common-level effects
    Intercept ~ Normal(mu: 0, sigma: 6.6279)
    neighborhood_transferred ~ Normal(mu: 0.0, sigma: 3.5808)
    fusion ~ Normal(mu: 0.0, sigma: 2.596)
    cooccurence ~ Normal(mu: 0.0, sigma: 3.6322)
    coexpression ~ Normal(mu: 0.0, sigma: 2.9462)
    coexpression_transferred ~ Normal(mu: 0.0, sigma: 2.9241)
    experiments ~ Normal(mu: 0.0, sigma: 2.6692)
    experiments_transferred ~ Normal(mu: 0.0, sigma: 2.8198)
    database ~ Normal(mu: 0.0, sigma: 2.9285)
    database_transferred ~ Normal(mu: 0.0, sigma: 2.5707)
    textmining ~ Normal(mu: 0.0, sigma: 3.5179)
    textmining_transferred ~ Normal(mu: 0.0, sigma: 3.7341)

And an Idata of:

print(output['classifiers'][0])
image

When trying to make a prediction on new data with the exact same column names as the training data: Running script with the following args:

output['models'][0].predict(idata=output['classifiers'][0], data=x, inplace=False)

output:

/mnt/mnemo5/sum02dean/sl_projects/handover/STRINGSCORE/src/scripts/nb.ipynb Cell 4' in <cell line: [1](vscode-notebook-cell://ssh-remote%2Blphobos/mnt/mnemo5/sum02dean/sl_projects/handover/STRINGSCORE/src/scripts/nb.ipynb#ch0000004vscode-remote?line=0)>()
----> 1[ output['models'][0].predict(idata=output['classifiers'][0], data=x, inplace=False)

File ~/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/bambi/models.py:897, in Model.predict(self, idata, kind, data, draws, inplace)
    ]()[892](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/bambi/models.py?line=891)[ # 'linear_predictor' is of shape
    ]()[893](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/bambi/models.py?line=892)[ # * (chain_n, draw_n, obs_n) for univariate models
    ]()[894](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/bambi/models.py?line=893)[ # * (chain_n, draw_n, response_n, obs_n) for multivariate models
    ]()[896](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/bambi/models.py?line=895)[ if kind == "mean":
--> ]()[897](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/bambi/models.py?line=896)[     idata.posterior = self.family.predict(self, posterior, linear_predictor)
    ]()[898](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/bambi/models.py?line=897)[ else:
    ]()[899](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/bambi/models.py?line=898)[     pps_kwargs = {
    ]()[900](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/bambi/models.py?line=899)[         "model": self,
    ]()[901](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/bambi/models.py?line=900)[         "posterior": posterior,
   (...)
    ]()[904](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/bambi/models.py?line=903)[         "draw_n": draw_n,
    ]()[905](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/bambi/models.py?line=904)[     }

File ~/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/bambi/families/univariate.py:18, in UnivariateFamily.predict(self, model, posterior, linear_predictor)
     ]()[16](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/bambi/families/univariate.py?line=15)[ # Drop var/dim if already present
     ]()[17](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/bambi/families/univariate.py?line=16)[ if name in posterior.data_vars:
---> ]()[18](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/bambi/families/univariate.py?line=17)[     posterior = posterior.drop_vars(name).drop_dims(coord_name)
     ]()[20](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/bambi/families/univariate.py?line=19)[ coords = ("chain", "draw", coord_name)
     ]()[21](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/bambi/families/univariate.py?line=20)[ posterior[name] = (coords, mean)

File ~/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/xarray/core/dataset.py:4602, in Dataset.drop_dims(self, drop_dims, errors)
   ]()[4600](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/xarray/core/dataset.py?line=4599)[     missing_dims = drop_dims - set(self.dims)
   ]()[4601](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/xarray/core/dataset.py?line=4600)[     if missing_dims:
-> ]()[4602](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/xarray/core/dataset.py?line=4601)[         raise ValueError(
   ]()[4603](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/xarray/core/dataset.py?line=4602)[             f"Dataset does not contain the dimensions: {missing_dims}"
   ]()[4604](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/xarray/core/dataset.py?line=4603)[         )
   ]()[4606](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/xarray/core/dataset.py?line=4605)[ drop_vars = {k for k, v in self._variables.items() if set(v.dims) & drop_dims}
   ]()[4607](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/xarray/core/dataset.py?line=4606)[ return self.drop_vars(drop_vars)

ValueError: Dataset does not contain the dimensions: {'y_mean_obs'}]()

Accessing the bambi specific objects via list indexing or dictionary lookup causes the issue. When not using an iterable or collection, the code works fine:

# Get the function formula
xt = output['train_splits'][0][0] # <--- train features
y = output['train_splits'][0][1] #  <---- train labels
xt['y'] = y.values

f = get_formula(xt.columns[:-1])
print(f)

model = bmb.Model(f, xt, family='bernoulli')
clf = model.fit(draws=10, tune=10,
                chains=3, init='auto')

model.predict(idata=clf, data=x, inplace=False)
print("ran without errors")

output:

y ~ neighborhood_transferred + fusion + cooccurence + coexpression + coexpression_transferred + experiments + experiments_transferred + database + database_transferred + textmining
Modeling the probability that y==1
Only 10 samples in chain.
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (3 chains in 4 jobs)
NUTS: [textmining, database_transferred, database, experiments_transferred, experiments, coexpression_transferred, coexpression, cooccurence, fusion, neighborhood_transferred, Intercept]

 100.00% [60/60 00:01<00:00 Sampling 3 chains, 0 divergences]
Sampling 3 chains for 10 tune and 10 draw iterations (30 + 30 draws total) took 2 seconds.
/mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/pymc3/sampling.py:643: UserWarning: The number of samples is too small to check convergence reliably.
  warnings.warn("The number of samples is too small to check convergence reliably.")
Ran without errors
tomicapretto commented 2 years ago

Hi @Sum02dean, thanks for opening this issue!

You have the following within the for loop

        # Collect outputs
        output_dict = {
            'predictions': predictions,
            'models': models,
            'classifiers': classifiers,
            'train_splits': train_splits,
            'test_splits': test_splits
        }

shouldn't it go outside the for loop?

Also, can you guarantee that the name of the response is always y? (because it's hard-coded in the loop).

And above all, could you share a minimum reproducible example? It does not necessarily need to be your data. Some random data that looks like your data and the code to reproduce the problem would be ideal.

Sum02dean commented 2 years ago

Hi @Sum02dean, thanks for opening this issue!

You have the following within the for loop

        # Collect outputs
        output_dict = {
            'predictions': predictions,
            'models': models,
            'classifiers': classifiers,
            'train_splits': train_splits,
            'test_splits': test_splits
        }

shouldn't it go outside the for loop?

Also, can you guarantee that the name of the response is always y? (because it's hard-coded in the loop).

And above all, could you share a minimum reproducible example? It does not necessarily need to be your data. Some random data that looks like your data and the code to reproduce the problem would be ideal.

Hi tomicapretto, Params actually goes outside the function. Here I lazily placed it into the function in the comment for demonstration purposes (sorry).

Yes labels is always 'y'

This seems to fix the issue me: Changing inplace=True during training predict() within the loop, then setting inplace=False for outside of the loop predict().

E.g:

inside loop

# Get the function formula
f = get_formula(x_train.columns[:-1])
print(f)
model = bmb.Model(f, x_train,  family=params['family'])
clf = model.fit(draws=params['draws'], tune=params['tune'],
                      chains=params['chains'], init='auto')

# Run predictions
model.predict(clf, data=x_test, inplace=True)
mean_preds = clf.posterior["y_mean"].values

outside loop:

new_idata = output['models'][0].predict(output['classifiers'][0], data=x, inplace=False)
mean_preds = new_idata.posterior["y_mean"].values

It seems there was something strange with how I called predict with the inplace flag. I just don't recall this being an issue before.

As for minimal code, I can do this, but currently working to tight deadline. If the above fix makes sense to you we can close the issue?

Sum02dean commented 2 years ago

ahh! I see the output_dict is in the for loop. Give me a moment. I will see if this fixes the issue. This should NOT be there :)

Sum02dean commented 2 years ago

ahh! I see the output_dict is in the for loop. Give me a moment. I will see if this fixes the issue. This should NOT be there :)

I fixed this indentation issue, but the issue remained.

tomicapretto commented 2 years ago

@Sum02dean I'm sorry you're on a tight deadline but the best way to me to help you is to have a reproducible example. Otherwise I need to generate some data making assumptions and guessing what you're trying to do. Also, there may be other issues in the code we're not seeing in this chunk.

Sum02dean commented 2 years ago

@Sum02dean I'm sorry you're on a tight deadline but the best way to me to help you is to have a reproducible example. Otherwise I need to generate some data making assumptions and guessing what you're trying to do. Also, there may be other issues in the code we're not seeing in this chunk.

Yes, you are right, let's see if this helps.

Below is everything you need to simulate my data and the error (minus the bells and whistles)

import os
import pandas as pd
import numpy as np
import arviz as az
import bambi as bmb
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import copy

# Functs
def get_formula(feature_names):
    """Generates the formula required for the bambi generalized linear model (GLM)

    :param feature_names: extracted columns names as list of string
    :type feature_names: list

    :return: a string formula containing the GLM functional formulae
    :rtype: string
    """
    template = ['{}'] * (len(feature_names))
    template = " + ".join(template)
    template = template.format(*list(feature_names))
    f = 'y ~ ' + template
    return f

def spoof_data(n_samples=100, n_features=12, scale=1.0):

       # Generate fake classification
       x_sim, y_sim = make_classification(n_samples=n_samples, n_features=n_features, 
       scale=1.0, shuffle=True, random_state=None)

       col_names = ['neighborhood_transferred', 'fusion', 'cooccurence', 'coexpression',
              'coexpression_transferred', 'experiments', 'experiments_transferred',
              'database', 'database_transferred', 'textmining',
              'textmining_transferred', 'cogs']

       features = pd.DataFrame(x_sim)
       features.columns = col_names
       labels = pd.DataFrame(y_sim)
       features['labels'] = labels.values
       return features

def model_splits(x, y, test_ratio):
    """Splits each x and y set into train and test data respectively (NOT on COGS)

    :param x: x-data with protein names as index
    :type x: pandas.core.DataFrame

    :param y: y-labels
    :type y: iterable e.g list, or pandas.core.Series

    :param test_ratio: proportion of observations for testing
    :type test_ratio: float

    :return: train-test splits for both x-data and y-data
    :rtype: collection of pandas DataFrame objects
    """

    # Make copy
    data = copy.deepcopy(x)
    labels = copy.deepcopy(y)

    # Split the dataset using scikit learn implimentation
    x_train, x_test, y_train, y_test = train_test_split(
        data, labels, test_size=test_ratio, shuffle=True)
    return x_train, x_test, y_train, y_test

def run_pipeline(x, params, train_ratio=0.8, n_runs=3):
    """Runs the entire modeling process

    :param data: x-data containing 'labels' and 'cogs' columns (which will be replaced later)
    :type data: pandas DataFrame object

    :param params: model hyper-parameter dictionary
    :type param: dict

    :param train_ratio: the proportion of data used for training, defaults to 0.8
    :type train_ratio: float, optional

    :return: Returns an output dict containing key information.
    :rtype: dict
    """

    print("Beginning pipeline...")
    test_ratio = 1 - train_ratio
    train_splits = []
    test_splits = []
    models = []
    classifiers = []
    predictions = []

    # Pre-allocate the datasets
    for i in range(1, n_runs + 1):

        # Random stratification
        x_train, x_test, y_train, y_test = model_splits(
            x, x.labels, test_ratio=test_ratio)

        # Drop the labels from x-train and x-test
        x_train.drop(columns=['labels', 'cogs'], inplace=True)
        x_test.drop(columns=['labels', 'cogs'], inplace=True)

        # Store all of the unique splits
        train_splits.append([x_train, y_train])
        test_splits.append([x_test, y_test])

    # CML message
    print("Complete with no errors")
    print('Done\n')

    # Train across n-unique subsets of the data
    for i in range(len(train_splits)):
        print("\nComputing predictions for sampling run {}".format(i + 1))

        # Pull out data splits
        x_train, y_train = train_splits[i]
        x_test, y_test = test_splits[i]

        # Run bambi model
        x_train['y'] = y_train.values

        # Get the function formula
        f = get_formula(x_train.columns[:-1])

        model = bmb.Model(f, x_train,  family=params['family'])
        clf = model.fit(draws=params['draws'], tune=params['tune'],
                        chains=params['chains'], init='auto')

        # Run predictions
        idata = model.predict(clf, data=x_test, inplace=False)
        mean_preds = idata.posterior["y_mean"].values
        predictions.append(mean_preds)

        # Append models
        models.append(model)
        classifiers.append(clf)

    # Collect outputs
    output_dict = {
        'predictions': predictions,
        'models': models,
        'classifiers': classifiers,
        'train_splits': train_splits,
        'test_splits': test_splits
    }
    return output_dict

if __name__ == '__main__':
    # Define model parameters
    params = {
        'family': 'bernoulli',
        'chains': 3,
        'draws': 10,
        'tune': 10}

    # Generate fake data
    x = spoof_data()
    output = run_pipeline(
                x=x, params=params, n_runs=1)

    # The next line causes the error
    x.drop(columns=['labels', 'cogs'], inplace=True)
    output['models'][0].predict(idata=output['classifiers'][0], data=x, inplace=False)

Throws:

ValueError                                Traceback (most recent call last)
/mnt/mnemo5/sum02dean/sl_projects/handover/STRINGSCORE/src/scripts/nb.ipynb Cell 6' in <cell line: 13>()
      [9](vscode-notebook-cell://ssh-remote%2Blphobos/mnt/mnemo5/sum02dean/sl_projects/handover/STRINGSCORE/src/scripts/nb.ipynb#ch0000020vscode-remote?line=8)[ x = spoof_data()
     ]()[10](vscode-notebook-cell://ssh-remote%2Blphobos/mnt/mnemo5/sum02dean/sl_projects/handover/STRINGSCORE/src/scripts/nb.ipynb#ch0000020vscode-remote?line=9)[ output = run_pipeline(
     ]()[11](vscode-notebook-cell://ssh-remote%2Blphobos/mnt/mnemo5/sum02dean/sl_projects/handover/STRINGSCORE/src/scripts/nb.ipynb#ch0000020vscode-remote?line=10)[             x=x, params=params, n_runs=1)
---> ]()[13](vscode-notebook-cell://ssh-remote%2Blphobos/mnt/mnemo5/sum02dean/sl_projects/handover/STRINGSCORE/src/scripts/nb.ipynb#ch0000020vscode-remote?line=12)[ output['models'][0].predict(idata=output['classifiers'][0], data=x, inplace=False)

File ~/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/bambi/models.py:897, in Model.predict(self, idata, kind, data, draws, inplace)
    ]()[892](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/bambi/models.py?line=891)[ # 'linear_predictor' is of shape
    ]()[893](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/bambi/models.py?line=892)[ # * (chain_n, draw_n, obs_n) for univariate models
    ]()[894](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/bambi/models.py?line=893)[ # * (chain_n, draw_n, response_n, obs_n) for multivariate models
    ]()[896](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/bambi/models.py?line=895)[ if kind == "mean":
--> ]()[897](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/bambi/models.py?line=896)[     idata.posterior = self.family.predict(self, posterior, linear_predictor)
    ]()[898](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/bambi/models.py?line=897)[ else:
    ]()[899](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/bambi/models.py?line=898)[     pps_kwargs = {
    ]()[900](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/bambi/models.py?line=899)[         "model": self,
    ]()[901](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/bambi/models.py?line=900)[         "posterior": posterior,
   (...)
    ]()[904](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/bambi/models.py?line=903)[         "draw_n": draw_n,
    ]()[905](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/bambi/models.py?line=904)[     }

File ~/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/bambi/families/univariate.py:18, in UnivariateFamily.predict(self, model, posterior, linear_predictor)
     ]()[16](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/bambi/families/univariate.py?line=15)[ # Drop var/dim if already present
     ]()[17](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/bambi/families/univariate.py?line=16)[ if name in posterior.data_vars:
---> ]()[18](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/bambi/families/univariate.py?line=17)[     posterior = posterior.drop_vars(name).drop_dims(coord_name)
     ]()[20](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/bambi/families/univariate.py?line=19)[ coords = ("chain", "draw", coord_name)
     ]()[21](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/bambi/families/univariate.py?line=20)[ posterior[name] = (coords, mean)

File ~/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/xarray/core/dataset.py:4602, in Dataset.drop_dims(self, drop_dims, errors)
   ]()[4600](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/xarray/core/dataset.py?line=4599)[     missing_dims = drop_dims - set(self.dims)
   ]()[4601](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/xarray/core/dataset.py?line=4600)[     if missing_dims:
-> ]()[4602](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/xarray/core/dataset.py?line=4601)[         raise ValueError(
   ]()[4603](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/xarray/core/dataset.py?line=4602)[             f"Dataset does not contain the dimensions: {missing_dims}"
   ]()[4604](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/xarray/core/dataset.py?line=4603)[         )
   ]()[4606](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/xarray/core/dataset.py?line=4605)[ drop_vars = {k for k, v in self._variables.items() if set(v.dims) & drop_dims}
   ]()[4607](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/xarray/core/dataset.py?line=4606)[ return self.drop_vars(drop_vars)

ValueError: Dataset does not contain the dimensions: {'y_mean_obs'}]()
aegonwolf commented 2 years ago

Hi, I have the same issue when I try to follow the example notebooks and adapt my pymc3 glm. When I call predict with the fitted posterior bambi is complaining the the label is not present in the data.

aegonwolf commented 2 years ago

@Sum02dean try setting kind = 'pps' this works for me (I thought about it because of the missing variable name y_"mean"_obs. Maybe @tomicapretto could explain why?

Edit: Ok I think I have the reason, somewhere above the code I was running I had a predict but with in place = True (default) Thus the data later was the posterior predictive Edit2: I think something is not right here, this shouldn't be happening. I am also using Bernoulli Family and have a few categorical variables.

tomicapretto commented 2 years ago

@Sum02dean thanks for sharing the snippet, that's exactly what I needed.

Unfortunately, I couldn't reproduce the issue. The code runs without problems on my side. Could you install the development version of Bambi?

This is the output of watermark

%load_ext watermark
%watermark -n -u -v -iv -w
Last updated: Wed Apr 13 2022

Python implementation: CPython
Python version       : 3.8.5
IPython version      : 7.29.0

arviz : 0.11.4
bambi : 0.7.1
sys   : 3.8.5 (default, Sep  4 2020, 07:30:14) 
[GCC 7.3.0]
pandas: 1.3.1
numpy : 1.20.0

Watermark: 2.1.0

It's says Bambi 0.7.1 but it's actually the development version.

tomicapretto commented 2 years ago

I think I found what's going on. The development version has the following chunk

https://github.com/bambinos/bambi/blob/be8c622eb6530e1d9a5071dfa1b1e90aad40921e/bambi/families/univariate.py#L16-L21

while the version you're using has

https://github.com/bambinos/bambi/blob/7d5a83f0bd8888a6c8136b01101548a9d23ef402/bambi/families/univariate.py#L16-L18

I'm sorry but I didn't recall this fix.

Installing from the main branch in the repository should fix this issue.

Sum02dean commented 2 years ago

Thank you @tomicapretto this solved the issue for me. @aegonwolf I reinstalled from main as tomicapretto recommended. First by uninstalling bambi and then running: pip install -U git+https://github.com/bambinos/bambi.git@main

Many thanks,

Dean