BiomedSciAI / causallib

A Python package for modular causal inference analysis and model evaluations
Apache License 2.0
728 stars 97 forks source link

The parameter 'std' keeps decreasing #46

Closed R2Bb1T closed 10 months ago

R2Bb1T commented 2 years ago

Thanks for your amazing work! I had some data tested with the HEMM method and the result of subgroup prediction is abnormal. With further evaluation, I found that the parameter 'std' of the Gaussian distribution keeps decreasing and fell below zero while it supposed to converge to a positive value. What's the cause of this phenomenon and how can I fix it? Is this about parameter initialization?

ehudkr commented 2 years ago

Thank you for using causallib and taking the time to report this problem. I'll admit models in the contrib module come with limited warranty, but I'll do my best to assist.

First, could you please provide a minimal code example that reproduces the problem?

Second, I'll ping @chiragnagpal (Hi Chirag! 👋 🙃 ), the paper's first author and the person who implemented the model, to see if he can find the time to take a look and see what he can make of it.

R2Bb1T commented 2 years ago

Thanks for your reply and help! Here is the example code:

from causallib.contrib.hemm import HEMM
import causallib.contrib.hemm.gen_synthetic_data as gen_synthetic_data
import causallib.contrib.hemm.hemm_outcome_models as hemm_outcome_models
import numpy as np
import pandas as pd

def generate_traindata():
    d = 100
    X, T, Y, Z, mu1, mu0 = gen_synthetic_data.gen_data(n=50000, d=d)
    HTE = mu1 - mu0

    data = np.column_stack((X, Y, T, HTE, mu1, mu0, Z))

    cols = ['x' + str(i) for i in range(d+1)]

    output_train_data = pd.DataFrame(data, columns=cols + ['Y', 'T', 'HTE', 'Y0', 'Y1', 'Z'])
    return output_train_data

train_data = generate_traindata()
hemm=HEMM(D_in=D_in,K=K,bc=bc,lamb=lamb,mu=mu,std=std,response=response,metric='AuROC',outcome_model=hemm_outcome_models.genMLPModule(D_in=D_in, H=2, out=2))
losses=hemm.fit(train_data.iloc[:,:100].values,train_data['T'].values,train_data['Y'].values)

K=2, batch_size = 30, other parameters remain default. I think it may be a overfitting problem, but still don't know the reason that 'std' kept decreasing and fell below zero.

chiragnagpal commented 2 years ago

Hi @ehudkr thanks for connecting me.. ! It's been good, just trying to finish wrapping up my thesis and move on to newer things : )

@R2Bb1T I looked at the code, and it indeed seems like the model isn't constraining the std variable to be positive. One simple way around this is to add a relu activation on the std parameter.

I can try pushing that fix to the code, but its prolly faster for you to fix it at your end before waiting for it to be reflected on the next release.

R2Bb1T commented 1 year ago

@chiragnagpal Thanks for your reply! The solution you mentioned worked! The result of ITE seems right. But the result of subgroup predicted does not match the truth subgroup very well. I kept the data d=2 to simulate the experiment in the paper. And I found that even all the hyperparameters are kept same and randomly generate similar data to experiment several times, the result can have a huge difference. The visualization varies a lot, sometimes even cannot form a circle, if do the center and the radius seems deviated, and I calculated the AUC between subgroup 'z' predicted and the truth 'Z', it also varies from 0.5 to 0.7. Is any procedure I conducted wrong or it does have the problem?
This time I change the d to 2, n to 1000, batchsize to 10, lamb to 0.1 and the threshold of posen and negen to 0.25.