Closed Sum02dean closed 2 years ago
Hi @Sum02dean, thanks for opening this issue!
You have the following within the for loop
# Collect outputs
output_dict = {
'predictions': predictions,
'models': models,
'classifiers': classifiers,
'train_splits': train_splits,
'test_splits': test_splits
}
shouldn't it go outside the for loop?
Also, can you guarantee that the name of the response is always y
? (because it's hard-coded in the loop).
And above all, could you share a minimum reproducible example? It does not necessarily need to be your data. Some random data that looks like your data and the code to reproduce the problem would be ideal.
Hi @Sum02dean, thanks for opening this issue!
You have the following within the for loop
# Collect outputs output_dict = { 'predictions': predictions, 'models': models, 'classifiers': classifiers, 'train_splits': train_splits, 'test_splits': test_splits }
shouldn't it go outside the for loop?
Also, can you guarantee that the name of the response is always
y
? (because it's hard-coded in the loop).And above all, could you share a minimum reproducible example? It does not necessarily need to be your data. Some random data that looks like your data and the code to reproduce the problem would be ideal.
Hi tomicapretto, Params actually goes outside the function. Here I lazily placed it into the function in the comment for demonstration purposes (sorry).
Yes labels is always 'y'
This seems to fix the issue me: Changing inplace=True during training predict() within the loop, then setting inplace=False for outside of the loop predict().
E.g:
inside loop
# Get the function formula
f = get_formula(x_train.columns[:-1])
print(f)
model = bmb.Model(f, x_train, family=params['family'])
clf = model.fit(draws=params['draws'], tune=params['tune'],
chains=params['chains'], init='auto')
# Run predictions
model.predict(clf, data=x_test, inplace=True)
mean_preds = clf.posterior["y_mean"].values
outside loop:
new_idata = output['models'][0].predict(output['classifiers'][0], data=x, inplace=False)
mean_preds = new_idata.posterior["y_mean"].values
It seems there was something strange with how I called predict with the inplace flag. I just don't recall this being an issue before.
As for minimal code, I can do this, but currently working to tight deadline. If the above fix makes sense to you we can close the issue?
ahh! I see the output_dict is in the for loop. Give me a moment. I will see if this fixes the issue. This should NOT be there :)
ahh! I see the output_dict is in the for loop. Give me a moment. I will see if this fixes the issue. This should NOT be there :)
I fixed this indentation issue, but the issue remained.
@Sum02dean I'm sorry you're on a tight deadline but the best way to me to help you is to have a reproducible example. Otherwise I need to generate some data making assumptions and guessing what you're trying to do. Also, there may be other issues in the code we're not seeing in this chunk.
@Sum02dean I'm sorry you're on a tight deadline but the best way to me to help you is to have a reproducible example. Otherwise I need to generate some data making assumptions and guessing what you're trying to do. Also, there may be other issues in the code we're not seeing in this chunk.
Yes, you are right, let's see if this helps.
Below is everything you need to simulate my data and the error (minus the bells and whistles)
import os
import pandas as pd
import numpy as np
import arviz as az
import bambi as bmb
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import copy
# Functs
def get_formula(feature_names):
"""Generates the formula required for the bambi generalized linear model (GLM)
:param feature_names: extracted columns names as list of string
:type feature_names: list
:return: a string formula containing the GLM functional formulae
:rtype: string
"""
template = ['{}'] * (len(feature_names))
template = " + ".join(template)
template = template.format(*list(feature_names))
f = 'y ~ ' + template
return f
def spoof_data(n_samples=100, n_features=12, scale=1.0):
# Generate fake classification
x_sim, y_sim = make_classification(n_samples=n_samples, n_features=n_features,
scale=1.0, shuffle=True, random_state=None)
col_names = ['neighborhood_transferred', 'fusion', 'cooccurence', 'coexpression',
'coexpression_transferred', 'experiments', 'experiments_transferred',
'database', 'database_transferred', 'textmining',
'textmining_transferred', 'cogs']
features = pd.DataFrame(x_sim)
features.columns = col_names
labels = pd.DataFrame(y_sim)
features['labels'] = labels.values
return features
def model_splits(x, y, test_ratio):
"""Splits each x and y set into train and test data respectively (NOT on COGS)
:param x: x-data with protein names as index
:type x: pandas.core.DataFrame
:param y: y-labels
:type y: iterable e.g list, or pandas.core.Series
:param test_ratio: proportion of observations for testing
:type test_ratio: float
:return: train-test splits for both x-data and y-data
:rtype: collection of pandas DataFrame objects
"""
# Make copy
data = copy.deepcopy(x)
labels = copy.deepcopy(y)
# Split the dataset using scikit learn implimentation
x_train, x_test, y_train, y_test = train_test_split(
data, labels, test_size=test_ratio, shuffle=True)
return x_train, x_test, y_train, y_test
def run_pipeline(x, params, train_ratio=0.8, n_runs=3):
"""Runs the entire modeling process
:param data: x-data containing 'labels' and 'cogs' columns (which will be replaced later)
:type data: pandas DataFrame object
:param params: model hyper-parameter dictionary
:type param: dict
:param train_ratio: the proportion of data used for training, defaults to 0.8
:type train_ratio: float, optional
:return: Returns an output dict containing key information.
:rtype: dict
"""
print("Beginning pipeline...")
test_ratio = 1 - train_ratio
train_splits = []
test_splits = []
models = []
classifiers = []
predictions = []
# Pre-allocate the datasets
for i in range(1, n_runs + 1):
# Random stratification
x_train, x_test, y_train, y_test = model_splits(
x, x.labels, test_ratio=test_ratio)
# Drop the labels from x-train and x-test
x_train.drop(columns=['labels', 'cogs'], inplace=True)
x_test.drop(columns=['labels', 'cogs'], inplace=True)
# Store all of the unique splits
train_splits.append([x_train, y_train])
test_splits.append([x_test, y_test])
# CML message
print("Complete with no errors")
print('Done\n')
# Train across n-unique subsets of the data
for i in range(len(train_splits)):
print("\nComputing predictions for sampling run {}".format(i + 1))
# Pull out data splits
x_train, y_train = train_splits[i]
x_test, y_test = test_splits[i]
# Run bambi model
x_train['y'] = y_train.values
# Get the function formula
f = get_formula(x_train.columns[:-1])
model = bmb.Model(f, x_train, family=params['family'])
clf = model.fit(draws=params['draws'], tune=params['tune'],
chains=params['chains'], init='auto')
# Run predictions
idata = model.predict(clf, data=x_test, inplace=False)
mean_preds = idata.posterior["y_mean"].values
predictions.append(mean_preds)
# Append models
models.append(model)
classifiers.append(clf)
# Collect outputs
output_dict = {
'predictions': predictions,
'models': models,
'classifiers': classifiers,
'train_splits': train_splits,
'test_splits': test_splits
}
return output_dict
if __name__ == '__main__':
# Define model parameters
params = {
'family': 'bernoulli',
'chains': 3,
'draws': 10,
'tune': 10}
# Generate fake data
x = spoof_data()
output = run_pipeline(
x=x, params=params, n_runs=1)
# The next line causes the error
x.drop(columns=['labels', 'cogs'], inplace=True)
output['models'][0].predict(idata=output['classifiers'][0], data=x, inplace=False)
Throws:
ValueError Traceback (most recent call last)
/mnt/mnemo5/sum02dean/sl_projects/handover/STRINGSCORE/src/scripts/nb.ipynb Cell 6' in <cell line: 13>()
[9](vscode-notebook-cell://ssh-remote%2Blphobos/mnt/mnemo5/sum02dean/sl_projects/handover/STRINGSCORE/src/scripts/nb.ipynb#ch0000020vscode-remote?line=8)[ x = spoof_data()
]()[10](vscode-notebook-cell://ssh-remote%2Blphobos/mnt/mnemo5/sum02dean/sl_projects/handover/STRINGSCORE/src/scripts/nb.ipynb#ch0000020vscode-remote?line=9)[ output = run_pipeline(
]()[11](vscode-notebook-cell://ssh-remote%2Blphobos/mnt/mnemo5/sum02dean/sl_projects/handover/STRINGSCORE/src/scripts/nb.ipynb#ch0000020vscode-remote?line=10)[ x=x, params=params, n_runs=1)
---> ]()[13](vscode-notebook-cell://ssh-remote%2Blphobos/mnt/mnemo5/sum02dean/sl_projects/handover/STRINGSCORE/src/scripts/nb.ipynb#ch0000020vscode-remote?line=12)[ output['models'][0].predict(idata=output['classifiers'][0], data=x, inplace=False)
File ~/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/bambi/models.py:897, in Model.predict(self, idata, kind, data, draws, inplace)
]()[892](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/bambi/models.py?line=891)[ # 'linear_predictor' is of shape
]()[893](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/bambi/models.py?line=892)[ # * (chain_n, draw_n, obs_n) for univariate models
]()[894](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/bambi/models.py?line=893)[ # * (chain_n, draw_n, response_n, obs_n) for multivariate models
]()[896](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/bambi/models.py?line=895)[ if kind == "mean":
--> ]()[897](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/bambi/models.py?line=896)[ idata.posterior = self.family.predict(self, posterior, linear_predictor)
]()[898](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/bambi/models.py?line=897)[ else:
]()[899](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/bambi/models.py?line=898)[ pps_kwargs = {
]()[900](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/bambi/models.py?line=899)[ "model": self,
]()[901](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/bambi/models.py?line=900)[ "posterior": posterior,
(...)
]()[904](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/bambi/models.py?line=903)[ "draw_n": draw_n,
]()[905](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/bambi/models.py?line=904)[ }
File ~/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/bambi/families/univariate.py:18, in UnivariateFamily.predict(self, model, posterior, linear_predictor)
]()[16](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/bambi/families/univariate.py?line=15)[ # Drop var/dim if already present
]()[17](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/bambi/families/univariate.py?line=16)[ if name in posterior.data_vars:
---> ]()[18](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/bambi/families/univariate.py?line=17)[ posterior = posterior.drop_vars(name).drop_dims(coord_name)
]()[20](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/bambi/families/univariate.py?line=19)[ coords = ("chain", "draw", coord_name)
]()[21](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/bambi/families/univariate.py?line=20)[ posterior[name] = (coords, mean)
File ~/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/xarray/core/dataset.py:4602, in Dataset.drop_dims(self, drop_dims, errors)
]()[4600](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/xarray/core/dataset.py?line=4599)[ missing_dims = drop_dims - set(self.dims)
]()[4601](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/xarray/core/dataset.py?line=4600)[ if missing_dims:
-> ]()[4602](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/xarray/core/dataset.py?line=4601)[ raise ValueError(
]()[4603](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/xarray/core/dataset.py?line=4602)[ f"Dataset does not contain the dimensions: {missing_dims}"
]()[4604](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/xarray/core/dataset.py?line=4603)[ )
]()[4606](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/xarray/core/dataset.py?line=4605)[ drop_vars = {k for k, v in self._variables.items() if set(v.dims) & drop_dims}
]()[4607](file:///mnt/mnemo5/sum02dean/miniconda3/envs/string-score-2.0/lib/python3.8/site-packages/xarray/core/dataset.py?line=4606)[ return self.drop_vars(drop_vars)
ValueError: Dataset does not contain the dimensions: {'y_mean_obs'}]()
Hi, I have the same issue when I try to follow the example notebooks and adapt my pymc3 glm. When I call predict with the fitted posterior bambi is complaining the the label is not present in the data.
@Sum02dean try setting kind = 'pps' this works for me (I thought about it because of the missing variable name y_"mean"_obs. Maybe @tomicapretto could explain why?
Edit: Ok I think I have the reason, somewhere above the code I was running I had a predict but with in place = True (default) Thus the data later was the posterior predictive Edit2: I think something is not right here, this shouldn't be happening. I am also using Bernoulli Family and have a few categorical variables.
@Sum02dean thanks for sharing the snippet, that's exactly what I needed.
Unfortunately, I couldn't reproduce the issue. The code runs without problems on my side. Could you install the development version of Bambi?
This is the output of watermark
%load_ext watermark
%watermark -n -u -v -iv -w
Last updated: Wed Apr 13 2022
Python implementation: CPython
Python version : 3.8.5
IPython version : 7.29.0
arviz : 0.11.4
bambi : 0.7.1
sys : 3.8.5 (default, Sep 4 2020, 07:30:14)
[GCC 7.3.0]
pandas: 1.3.1
numpy : 1.20.0
Watermark: 2.1.0
It's says Bambi 0.7.1 but it's actually the development version.
I think I found what's going on. The development version has the following chunk
while the version you're using has
I'm sorry but I didn't recall this fix.
Installing from the main branch in the repository should fix this issue.
Thank you @tomicapretto this solved the issue for me. @aegonwolf I reinstalled from main as tomicapretto recommended. First by uninstalling bambi and then running:
pip install -U git+https://github.com/bambinos/bambi.git@main
Many thanks,
Dean
OS: Linux Bambi: 0.7.1 Python: 3.8
Issue: I am receiving the above issues when using models and idata bambi objects when they are stored inside a dictionary or list. When they are not stored inside a collection, the expected behavior is observed. This is a recent issue since installing bambi 0.7.1 (my code worked previously).
With a single iteration I get a model of:
output:
And an Idata of:
When trying to make a prediction on new data with the exact same column names as the training data: Running script with the following args:
output:
Accessing the bambi specific objects via list indexing or dictionary lookup causes the issue. When not using an iterable or collection, the code works fine:
output: