amarquand / PCNtoolkit

Toolbox for normative modelling and spatial inference of neuroimaging data. https://pcntoolkit.readthedocs.io/en/latest/
GNU General Public License v3.0
112 stars 50 forks source link

Odd metrics from normative.estimate #205

Closed MattFill closed 6 months ago

MattFill commented 6 months ago

Thank you to the PCN Toolkit team for this wonderful normative modeling resource.

I am getting unexpected metric results from my HBR model using pcntoolkit==0.29. I am adapting my code from the HBR FCON tutorial (https://pcntoolkit.readthedocs.io/en/latest/pages/HBR_NormativeModel_FCONdata_Tutorial.html) to apply the HBR approach to some UK Biobank data. However, I am getting abnormally low EXPV estimates and negative Rho estimates from my model which is using Age and sex covariates to model volumetric brain phenotypes. My code is adapted directly from the HBR FCON tutorial follows:

idps = ['Volume of grey matter (normalised for head size)']
X = df_hc[['Age_T2','Sex_T0']].to_numpy(dtype=float)
Y = zscore(df_hc[idps].to_numpy(dtype=float))
batch_effects = df_hc[['site']].to_numpy(dtype=int)

# Split the data into train and test sets
X_train, X_test, Y_train, Y_test, batch_effects_train, batch_effects_test = train_test_split(
    X, Y, batch_effects, test_size=0.2, random_state=42, shuffle=True
)

# Save train files
with open('X_train.pkl', 'wb') as file:
    pickle.dump(pd.DataFrame(X_train), file)
with open('Y_train.pkl', 'wb') as file:
    pickle.dump(pd.DataFrame(Y_train), file)
with open('trbefile.pkl', 'wb') as file:
    pickle.dump(pd.DataFrame(batch_effects_train), file)

# Save test files
with open('X_test.pkl', 'wb') as file:
    pickle.dump(pd.DataFrame(X_test), file)
with open('Y_test.pkl', 'wb') as file:
    pickle.dump(pd.DataFrame(Y_test), file)
with open('tsbefile.pkl', 'wb') as file:
    pickle.dump(pd.DataFrame(batch_effects_test), file)

# a simple function to quickly load pickle files
def ldpkl(filename: str):
    with open(filename, 'rb') as f:
        return pickle.load(f)

respfile = os.path.join(processing_dir, 'Y_train.pkl')       # measurements  (eg cortical thickness) of the training samples (columns: the various features/ROIs, rows: observations or subjects)
covfile = os.path.join(processing_dir, 'X_train.pkl')        # covariates (eg age) the training samples (columns: covariates, rows: observations or subjects)

testrespfile_path = os.path.join(processing_dir, 'Y_test.pkl')       # measurements  for the testing samples
testcovfile_path = os.path.join(processing_dir, 'X_test.pkl')        # covariate file for the testing samples

trbefile = os.path.join(processing_dir, 'trbefile.pkl')      # training batch effects file (eg scanner_id, gender)  (columns: the various batch effects, rows: observations or subjects)
tsbefile = os.path.join(processing_dir, 'tsbefile.pkl')      # testing batch effects file

output_path = os.path.join(processing_dir, 'Models/')    #  output path, where the models will be written
log_dir = os.path.join(processing_dir, 'log/')           #
if not os.path.isdir(output_path):
    os.mkdir(output_path)
if not os.path.isdir(log_dir):
    os.mkdir(log_dir)

outputsuffix = '_estimate'      # a string to name the output files, of use only to you, so adapt it for your needs.

ptk.normative.estimate(
                        # Algorithm
                        alg='hbr',
                        # Paths to data
                        covfile=covfile,
                        respfile=respfile,
                        tsbefile=tsbefile,
                        trbefile=trbefile,
                        # Paths to output
                        output_path=output_path, 
                        testcov= testcovfile_path,
                        testresp = testrespfile_path,
                        log_path=log_dir,
                        outputsuffix=outputsuffix, 
                        # Additional args
                        binary=True,
                        savemodel=True,
                        standardize=False,
                        cores=4
                      )

The EXPV of this model is -0.000212 and the Rho is -0.134802. These values are unexpected given the strong association of Age and Sex in brain volume phenotypes in this dataset (R2 of .25). Furthermore, I obtain similarly discrepant results using the 'blr' and 'gpr' models.

Any insight into the cause and fix of these unexpected results would be greatly appreciated.

amarquand commented 6 months ago

Hi, it is hard to help you without having more details. I think there must be a problem with the data you are entering (maybe the rows in the covariates and responses files do not align properly for instance).

I assume you are running this against UKB IDP 25005-2.0. I just ran the same test using version 0.29 and I get an explained variance of 0.44, so I don't think this is a problem with the toolbox.

Please carefully check the data you are putting in. I will close this issue for now.

amarquand commented 6 months ago

BTW, I ran this with BLR

MattFill commented 6 months ago

Thanks for the follow up. Yes, I'm using IDP 25005-2.0. I believe I've found a fix/workaround for my case. If I standardize and round the covariates and response variables, the function returns expected metrics:

X = zscore(df_hc[['Age_T2','Sex_T0']]).round(3)
Y = zscore(df_hc[idps].to_numpy(dtype=float)).round(3)
batch_effects = df_hc[['site']].to_numpy(dtype=int)

Perhaps its was a scale or numerical stability issue?