bambinos / bambi

BAyesian Model-Building Interface (Bambi) in Python.
https://bambinos.github.io/bambi/
MIT License
1.07k stars 122 forks source link

[Help needed] model/formula for differences in deviation? #233

Open OlafHaag opened 4 years ago

OlafHaag commented 4 years ago

Hi! There's no designated help/community area for bambi, so I seek help here. I hope, according to the contribution guidelines this can be viewed as indirectly improving the documentation (docs) or examples (bambi/examples). I'm new to statistical modeling with python and pymc3 and I've been repeatedly trying to build a hierarchical model with pymc3 without progress. So I hope bambi will simply solve my newbie problems somehow. I'm trying to infer estimates about deviation in my data between blocks and 2 measurements, while the mean is always 0. How would a formula for this look like, since I'm not describing individual data points, but their distribution?

I do have a 3 x (3) Mixed Design: 3 conditions as BETWEEN factor, 3 blocks as WITHIN factor. Participants (users) are in different conditions. All conditions have 3 blocks. There are multiple measurements in each block (trials). There are 2 dependent variables: parallel & orthogonal. Since 'parallel' and 'orthogonal' measures have the same unit, it can be viewed as 1 outcome called 'projection' in 2 different directions.

My interest here is in the VARIABILITY (e.g. standard deviation or variance) of the data for parallel & orthogonal outcomes for blocks 1, 2, and 3. The analysis should enable to make estimates about differences in trial-to-trial deviation of the outcome due to condition, block, direction or their interaction. E.g. 'Is there a main effect of direction? Is the deviation for orthogonal direction in block 2 greater than in blocks 1 or 3? Is the difference between orthogonal and parallel deviations greater or smaller in block 2?'

Users may have different intercepts and different slopes, but these are random effects. I also don't care about individual trials, only about overall trial-to-trial variability.

Bonus: Each user has an individual level of experience in the task. Users rate the difficulty of each block they performed.

Neglecting the different conditions, the standard deviation in different blocks by direction could look like any of these examples (or totally different). i is for intercept, b for slope and numbers stand for block (1, 2, 3), so i13 means intercept block 1 and 3.

    Block x Direction v1    Block x Dir. v2       Block x Direction v3
    sigma                   sigma                 sigma          
    ^                       ^                     ^              
    |      / b13            |      / b13          |      / b1    
    |     /                 |     /               |     /        
    |    /                  |    /                |    /         
  i2|___/___ b2             |   /                 |   /          
    |  /                    |  /                  |  /           
    | /                     | /                   | /            
 i13|/                  i123|/______ b2       i123|/______ b23   
    |                       |                     |              
    |---------->            |---------->          |---------->   
    0      1                0      1              0      1       
    ortho  parallel         ortho  parallel       ortho  parallel

The mean of the data for each direction (orthogonal/parallel) in each block is 0. Which means that the mean of the squared data is the variance of the data. If I have to square the data to make it work with the formula, I probably also have to log-transform it. In this case I could use Normal distributions to approximate the transformed data points. Otherwise, deviations are obviously positive only values, so I'd have to change the family of the priors.

Some mock-up data:

# %%
import numpy as np
import pandas as pd
from bambi import Model, Prior

# %%
n_trials = 30  # by block.

user0 = pd.DataFrame({'user': 0,
                      'condition': 'A',
                      'experience': 0.0,
                      'block': 1 + np.random.randint(3, size=3*n_trials),
                      'parallel': np.random.normal(0.0, 5.5, size=3*n_trials),
                      'orthogonal': np.random.normal(0.0, 5.5, size=3*n_trials)})

user1 = pd.DataFrame({'user': 1,
                      'condition': 'A',
                      'experience': np.nan,
                      'block': 1 + np.random.randint(3, size=3*n_trials),
                      'parallel': np.random.normal(0.0, 4.5, size=3*n_trials),
                      'orthogonal': np.random.normal(0.0, 5.5, size=3*n_trials)})

# %%
# Create some data conforming to main effect of projection model: variance in parallel > orthogonal.
user2 = pd.DataFrame({'user': 2, 
                      'condition': 'B',
                      'experience': 4.0,
                      'block': 1 + np.random.randint(3, size=3*n_trials),
                      'parallel': np.random.normal(0.0, 7.0, size=3*n_trials),
                      'orthogonal': np.random.normal(0.0, 3.3, size=3*n_trials)})

user3 = pd.DataFrame({'user': 3, 
                      'condition': 'B',
                      'experience': 3.0,
                      'block': 1 + np.random.randint(3, size=3*n_trials),
                      'parallel': np.random.normal(0.0, 8.0, size=3*n_trials),
                      'orthogonal': np.random.normal(0.0, 4.5, size=3*n_trials)})

user4 = pd.DataFrame({'user': 4, 
                      'condition': 'B',
                      'experience': 3.0,
                      'block': 1 + np.random.randint(3, size=3*n_trials),
                      'parallel': np.random.normal(0.0, 5.0, size=3*n_trials),
                      'orthogonal': np.random.normal(0.0, 1.5, size=3*n_trials)})

# %%
# Create some data conforming to interaction effect version 1:
# blocks 1&3 have smallest orthogonal deviations, but largest parallel deviations. Strong synergy.
user5 = pd.DataFrame({'user': 5,
                      'condition': 'C',
                      'experience': 0.0,
                      'block': 1 + np.random.randint(3, size=3*n_trials),
                      'parallel': np.random.normal(0.0, 7.0, size=3*n_trials),
                      'orthogonal': np.random.normal(0.0, 3.0, size=3*n_trials)})
# Block 2 has smaller parallel deviations than blocks 1&3, but higher orthogonal deviations than blocks 1&3.
# In this case sigma_block2_parallel - sigma_block2_orthogonal = 0
# No synergy.
user5.loc[user5['block'] == 2, ['parallel', 'orthogonal']] = np.random.normal(0.0, 5.0, size=((user5['block']==2).sum(), 2))

user6 = pd.DataFrame({'user': 6,
                      'experience': 4.0,
                      'condition': 'C',
                      'block': 1 + np.random.randint(3, size=3*n_trials),
                      'parallel': np.random.normal(0.0, 6.5, size=3*n_trials),
                      'orthogonal': np.random.normal(0.0, 2.5, size=3*n_trials)})
user6.loc[user6['block'] == 2, ['parallel', 'orthogonal']] = np.random.normal(0.0, 4.5, size=((user6['block']==2).sum(), 2))
# %%
df = pd.concat((user0, user1, user2, user3, user4, user5, user6), axis='index')
df.sort_values(by=['user', 'block'], inplace=True)
df.reset_index(drop=True, inplace=True)
# Make sure mean of each block is zero as with real data.
df[['parallel', 'orthogonal']] = df.groupby(['user', 'block'])[['parallel', 'orthogonal']].transform(lambda x: x - x.mean())

# %%
# Give each block its own ID.
df['block_id'] = pd.factorize(df['user'].astype(str) + df['block'].astype(str))[0]
# Trials.
df['trial'] = df.groupby(['block_id'])['block'].transform(lambda x: np.arange(len(x)))
# Categorical columns.
df[['user', 'condition', 'block_id', 'block']] = df[['user', 'condition', 'block_id', 'block']].astype('category')
# Give each block_id a rating.
df['rating'] = df['block_id'].map(pd.Series(index=df['block_id'].unique(), data=np.random.randint(5, size=df['block_id'].cat.categories.size)))

# %%
# Thin out rows to simulate missing data.
mask = np.random.choice([*[True]*4, False], size=df.shape[0])
df = df[mask]

# %%
# Reshape data into long format.
samples = df.melt(id_vars=['user', 'experience', 'condition', 'block_id', 'block', 'rating', 'trial'],
                  value_vars=['parallel', 'orthogonal'],
                  var_name='direction', value_name='projection')
samples['direction'] = samples['direction'].astype('category')

I believe in R (lme4) it would be similar to this \sigma_{projection} \sim (condition * block * direction)/(1|trial) + (1|user) although I'm not sure, if that means only random intercepts for user without random slopes (I do want random slopes by user). The slash means the trials are nested.

Any help is appreciated, I'm not familiar with this.

tomicapretto commented 3 years ago

Hi, sorry for the delayed reply. Were you able to solve this problem? If not, I'll try to help you.

tomicapretto commented 1 year ago

Haven't heard in a while about this issue