[ENH] Document how multisite harmonization might work

mattcieslak commented 1 year ago

The process of getting data into the h5 format is the same as converting it into the format needed for the multi-site harmonization package. We should consider either writing a function that applies the harmonizer or document/provide a script that shows how to do it

smeisler commented 1 year ago

Here is some code that might be useful:

import nibabel as nib
import numpy as np
import os.path as op
from neuroHarmonize import harmonizationLearn
import pandas as pd
import h5py
from collections import defaultdict

# Specify important paths
bids = '/nese/mit/group/sig/projects/hbn/hbn_bids/'
freesurfer_path = op.join(bids,'derivatives','freesurfer_7.3.2')
cohort_path = op.join(bids,'code','cohort.csv')

# Load dataframe with phenotypic data
# One of these columns must be called "SITE" and contain site numbers/names
cohort = pd.read_csv(cohort_path)
cohort = cohort.set_index('Subject')
subs = np.asarray(cohort.index)

# Extract only covariates important for harmonization
# Besides "SITE" must be all numeric (so no 'M'/'F' sex, for example)
harmonization_keys = ['SITE', 'Age', 'Sex']
dataframe_harmonization = cohort[covar_keys]

# Define data to collect
smoothing = '10' # mm FWHM
metrics = ['area','volume','thickness']

# Collect morphometric data, run harmonization
for metric in metrics:
    print('Metric:',metric)
    print('\t Gathering data across subjects')
    # Make matrix of concatenated lh and rh morphometric values across all subjects (N_subs X N_vertex)
    # This is a long one-liner...
    morph_data_matrix = np.asarray([np.concatenate((nib.MGHImage.from_filename(op.join(freesurfer_path,f"{sub}/surf/lh.{metric}.fwhm{smoothing}.fsaverage.mgh")
                                                   ).get_fdata()[:,0,0],
                                                   (nib.MGHImage.from_filename(op.join(freesurfer_path,f"{sub}/surf/rh.{metric}.fwhm{smoothing}.fsaverage.mgh")
                                                   ).get_fdata()[:,0,0]))) for sub in subs])

    # Remove columns that contain 0s (usually around medial wall)
    columns_without_0s = [0 not in col for col in np.transpose(morph_data_matrix)]
    morph_data_matrix_no0 = morph_data_matrix[:,columns_without_0s]
    # Save out zero-indexes for later
    zero_inds = np.where(np.asarray(columns_without_0s)==False)[0]
    np.save(op.join(bids,'code',f'{metric}_vertex_indexes'), zero_inds)

    # Harmonize the data (smooth term for age)
    print('\t Harmonizing the data')
    model, morph_data_matrix_adj = harmonizationLearn(morph_data_matrix_no0, dataframe_harmonization, smooth_terms=['Age'])

    # Save as H5 for ModelArray
    print('\t Saving harmonized data')
    output_h5_path = op.join(bids,'code', f'{metric}.h5')
    # Initialize H5 inputs
    scalars = defaultdict(list)
    sources_lists = defaultdict(list)
    # Populate H5 file
    scalars[metric] = morph_data_matrix_adj # morphometric data
    sources_lists[metric] = range(len(scalar_data)) # don't know what would count as "source" in this case... so just put index vector
    # Write the output
    f = h5py.File(output_h5_path, "w")
    for scalar_name in scalars.keys(): 
            one_scalar_h5 = f.create_dataset('scalars/{}/values'.format(scalar_name),
                             data=np.row_stack(scalars[scalar_name]))
            one_scalar_h5.attrs['column_names'] = list(sources_lists[scalar_name])  # column names: list of source .mif filenames
    f.close()

Note that before harmonization I removed columns with zeros so harmonization wouldn't crash. I add them back as NaNs after running modelarray such that the length of modelarray outputs match what is expected by the fsaverage surface meshes.

smeisler commented 1 year ago

To write out ModelArray outputs as .mgh (using R, this went in the same script I used to invoke ModelArray)

## Import packages
library(ModelArray) # to run modelarray
library(RcppCNPy) # to load the zero-indexes .npy file from above
library(berryFunctions) # for the InsertRows function, used to add NaNs where the zero-indexes were
library(freesurferformats) # FS I/O functions

# Run model, assume that phenotypes, metric, formula, etc... were defined before hand
mygam <- ModelArray.gam(formula, data = modelarray, phenotypes = phenotypes, scalar = metric, element.subset = NULL,
                          correct.p.value.smoothTerms = c("fdr"),
                          correct.p.value.parametricTerms = c("fdr"),
                          changed.rsq.term.index = c(1),
                          full.outputs = TRUE, 
                        n_cores = 32, pbar = TRUE, verbose = TRUE)

# Get 1-p_values for all p_value columns
one_minus_p <- 1 - mygam[, grep("p.value", names(mygam))]
# Append ".omp" to column name to indicate one minus P
colnames(one_minus_p) <- paste(colnames(one_minus_p),"omp",sep=".")
# Add 1-p columns to original dataframe
df <- cbind(mygam, one_minus_p)
# Sort columns alphabetically
df <- df[ , order(names(df))]

# Add NaN rows back into dataframe, this is important since fsaverage assumes a set number of vertices
# "zero_ind_path" points to a .npy file that contains a vector with the columns that contained zero from the code in the earlier comment.
zero_inds <- npyLoad(zero_ind_path, type='integer', dotranspose=TRUE) + 1 # +1 because python is zero-indexed
df <- insertRows(df, zero_inds , new = NA)

## Write out columns as .mgh format
# Make output directory if not exist
dir.create(file.path(outdir,analysis_name))
# Save each column of dataframe as separate mgh file
for (col in names(df)) {
    df_col <- df[col]
    # Separate into lh and rh (first and second half of rows)
    df_col_lh <- df_col[1:(nrow(df_col)/2),]
    df_col_rh <- df_col[(nrow(df_col)/2 + 1): nrow(df_col),]
    outfile_lh <- file.path(outdir,analysis_name,paste0('lh.',col,'.mgh'))
    outfile_rh <- file.path(outdir,analysis_name,paste0('rh.',col,'.mgh'))
    # Write out lh and rh as morphological files
    write.fs.morph(outfile_lh, df_col_lh)
    write.fs.morph(outfile_rh, df_col_rh)
    }

zhao-cy commented 1 year ago

Thanks Steven!!! This is very clear!

zhao-cy commented 1 year ago

Per @mattcieslak, similar idea could be done for software that generates csv files, and we can convert that as HDF5 format for input for ModelArray. For example, output bundle profile csv from pyafq

zhao-cy commented 1 year ago

Ted (during informatics team meeting on 2/27/2023): will ask Taki re: harmonization tools usually require what kind of input format? We will follow up with Ted regarding this.

smeisler commented 1 year ago

Neuroharmonize has an API for niftis but can accept any numpy array (columns might represent voxels, vertices, or anything else)

zhao-cy commented 1 year ago

Thanks @smeisler ! We are just wondering the input format requirement for other popular harmonization tools, so that we can support more than just Neuroharmonize (i.e., one new converter for several harmonization tools, if possible). But I probably won't have bandwidth for this new feature for now; will re-visit this issue later.

PennLINC / ConFixel

[ENH] Document how multisite harmonization might work #26