Open tsalo opened 5 years ago
If we're interested in getting z-scores for the betas, then I think the following is an appropriate replacement for computefeats2
:
def t_to_z(t_values, dof):
"""
From Vanessa Sochat's TtoZ package.
"""
# Select just the nonzero voxels
nonzero = t_values[t_values != 0]
# We will store our results here
z_values = np.zeros(len(nonzero))
# Select values less than or == 0, and greater than zero
c = np.zeros(len(nonzero))
k1 = (nonzero <= c)
k2 = (nonzero > c)
# Subset the data into two sets
t1 = nonzero[k1]
t2 = nonzero[k2]
# Calculate p values for <=0
p_values_t1 = stats.t.cdf(t1, df=dof)
z_values_t1 = stats.norm.ppf(p_values_t1)
z_values_t1[np.isinf(z_values_t1)] = stats.norm.ppf(1e-16)
# Calculate p values for > 0
p_values_t2 = stats.t.cdf(-t2, df=dof)
z_values_t2 = -stats.norm.ppf(p_values_t2)
z_values_t2[np.isinf(z_values_t2)] = -stats.norm.ppf(1e-16)
z_values[k1] = z_values_t1
z_values[k2] = z_values_t2
# Write new image to file
out = np.zeros(t_values.shape)
out[t_values != 0] = z_values
return out
def get_coeffs_and_tstats(X, y):
"""
From https://stackoverflow.com/a/42677750/2589328
"""
assert X.ndim == 2
if y.ndim != 2:
y = y[None, :]
assert X.shape[0] == y.shape[-1]
y = y.T
X -= np.mean(X, axis=0)
y -= np.mean(y, axis=0)
slopes, _, _, _ = np.linalg.lstsq(X, y, rcond=None)
df = X.shape[0] - (X.shape[1] + 1)
pred = np.dot(X, slopes)
# N DVs
MSE = sum((y - pred) ** 2) / df
# N IVs
var_b = np.dot(np.atleast_2d(MSE).T,
np.atleast_2d(np.linalg.inv(np.dot(X.T, X)).diagonal()))
sd_b = np.sqrt(var_b)
slopes = slopes.T
tstats = slopes / sd_b
zstats = t_to_z(tstats, df)
return slopes, zstats
The problem with my proposed approach above is that test statistics for parameter estimates seem to be meaningless when the degrees of freedom are low. The t-statistics end up being grossly inflated (often in the millions to billions with dof = 1), and the z-statistics then end up being binarized at +/- 8 (the max value).
Here are a couple of possible solutions:
np.arctanh(r)
just makes the correlation coefficient normally distributed, but we later compare those values to z-value thresholds, so we definitely need the test statistics. We need to do np.arctanh(r) * np.sqrt(n_trs - 3)
for it to make any sense. Could you elaborate on point 1? If it's good but computationally intense, we can see if we're clever enough to find a way around that problem.
I believe that it would look something like this [completely untested code]:
def computefeats2(data, mmix, mask, n_iters=1000):
"""
Converts `data` to component space using `mmix`
Parameters
----------
data : (S x T) array_like
Input data
mmix : (T [x C]) array_like
Mixing matrix for converting input data to component space, where `C`
is components and `T` is the same as in `data`
mask : (S,) array_like
Boolean mask array
n_iters : int, optional
Number of iterations to use to build null distributions. Default is 1000.
Returns
-------
data_Z : (S x C) :obj:`numpy.ndarray`
Voxel-wise Z-statistics for each component
"""
if data.ndim != 2:
raise ValueError('Parameter data should be 2d, not {0}d'.format(data.ndim))
elif mmix.ndim not in [2]:
raise ValueError('Parameter mmix should be 2d, not '
'{0}d'.format(mmix.ndim))
elif mask.ndim != 1:
raise ValueError('Parameter mask should be 1d, not {0}d'.format(mask.ndim))
elif data.shape[0] != mask.shape[0]:
raise ValueError('First dimensions (number of samples) of data ({0}) '
'and mask ({1}) do not match.'.format(data.shape[0],
mask.shape[0]))
elif data.shape[1] != mmix.shape[0]:
raise ValueError('Second dimensions (number of volumes) of data ({0}) '
'and mmix ({1}) do not match.'.format(data.shape[0],
mmix.shape[0]))
# z-score inputs to regression
data_z = stats.zscore(data[mask], axis=-1)
mmix_z = stats.zscore(mmix, axis=0)
# get betas of `data`~`mmix` and limit to range [-0.999, 0.999]
# (S x C)
data_betas = get_coeffs(data_z, mmix_z, mask=None)
betas_null = np.zeros(data_betas.shape + (n_iters,)) # S x C x 1000
for i_iter in range(n_iters):
for j_comp in range(mmix.shape[1]):
temp_mmix = mmix_z.copy()
idx = np.random.RandomState(seed=i_iter).permutation(mmix.shape[0])
temp_mmix[:, j_comp] = temp_mmix[idx, j_comp]
temp = get_coeffs(data_z, temp_mmix, mask=None)
betas_null[:, j_comp, i_iter] = temp[:, j_comp]
# Get z-values for betas using empirical nulls
p_values = null_to_p(data_betas, betas_null, tail='two')
signs = np.sign(data_betas - np.mean(betas_null, axis=-1))
data_Z = p_to_z(p_values, tail='two')
data_Z *= signs
return data_Z
def null_to_p(test_value, null_array, tail='two'):
"""
Return two-sided p-value for test value against null array.
Parameters
----------
test_value : :obj:`float`, :obj:`int`, or array_like
Either a single value to compare to a 1D array or an ND array to
compare to an N+1D array.
null_array : array_like
An array with one more dimension than test_value. The last dimension
should be the permutations.
tail : {'two', 'upper', 'lower'}
Tails to use for the calculation of p-values.
Returns
-------
p_value : :obj:`float` or array_like
Either a single p-value or an array with the same shape as test_value
"""
if isinstance(test_value, np.ndarray):
assert null_array.ndim == (test_value.ndim + 1)
values_1d = np.reshape(test_value, test_value.size)
nulls_2d = np.reshape(null_array, (test_value.size, null_array.shape[-1]))
p_values = np.zeros(values_1d.shape)
z_values = np.zeros(values_1d.shape)
for idx in range(values_1d.shape[0]):
p_values[idx] = null_to_p(values_1d[idx], nulls_2d[idx, :], tail=tail)
p_values = np.reshape(p_values, test_value.shape)
return p_values
if tail == 'two':
p_value = (50 - np.abs(stats.percentileofscore(
null_array, test_value) - 50.)) * 2. / 100.
elif tail == 'upper':
p_value = 1 - (stats.percentileofscore(null_array, test_value) / 100.)
elif tail == 'lower':
p_value = stats.percentileofscore(null_array, test_value) / 100.
else:
raise ValueError('Argument "tail" must be one of ["two", "upper", '
'"lower"]')
return p_value
def p_to_z(p, tail='two'):
"""
Convert p-values to z-values.
Parameters
----------
p : array_like
Array of p-values
tail : {'one', 'two'}
Tails to use.
Returns
-------
z : array_like
Array of z-values
"""
eps = np.spacing(1)
p = np.array(p)
p[p < eps] = eps
if tail == 'two':
z = ndtri(1 - (p / 2))
z = np.array(z)
elif tail == 'one':
z = ndtri(1 - p)
z = np.array(z)
z[z < 0] = 0
else:
raise ValueError('Argument "tail" must be one of ["one", "two"]')
if z.shape == ():
z = z[()]
return z
UPDATE: This takes about 4 minutes per iteration with 160 components. To build a null distribution, I think we'd want at least 1000 iterations.
@smoia You've been digging into MELODIC's code a bit. Do you know how they convert their component maps into z-statistics?
If we just use more aggressive dimensionality reduction in the PCA step prior to metric calculation, then the properly calculated statistics should be valid. We can do this by setting n_components
to be a proportion of the variance. I think somewhere between 0.95 and 0.99 should remove enough low-variance components to give us reasonable degrees of freedom.
If we want to go with the permutation approach, it looks like nilearn.mass_univariate.permuted_ols does what I was thinking, so it's an issue that's been dealt with already I guess. It would add a lot of time to the workflow, but we could maybe support it as an option.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions to tedana:tada: !
This issue isn't as stale as you think, bot.
This issue should be somewhat resolved by our shift to MA-PCA methods (see https://github.com/ME-ICA/tedana/issues/178#issuecomment-512807879), and we decided to refactor the regression function instead of using a nonparametric approach like permuted_ols
.
We still need to do the refactor, which includes converting z-values to z-statistics.
This is coming up as well in the BrainHack Donostia project ica-aroma-reorg
, where we need a solid version of computefeats2
to reproduce MELODIC's outputs.
Still need the contribution? I can look into it!
I think that the current solution is available here. @CesarCaballeroGaudes @smoia do you know what, if anything, still needs to be done to get it working?
EDIT: I'm happy to handle merge conflicts and any documentation tuning that needs to happen if the core code is working.
Hello, sorry for the delay! @tsalo I think the correct implementation is here. I can open a PR and see what's left to do to merge, what do you think?
That would be amazing. Thank you!
I believe that
computefeats2
tries to calculate correlation coefficients between the input data and the mixing matrix (see below) by variance normalizing the data and usingnp.linalg.lstsq
. It then crops extreme values (<-0.999 and >0.999) and converts the correlation coefficients to z-values. I assumed that the cropping was to prevent large z-values with correlation coefficients approaching 1 and -1. However, it doesn't look like normalization is doing what we want, and the "correlation coefficients" can in fact end up quite large. I added in a couple of lines to print out the max and min values ofdata_R
before cropping, and got values as high as +/- 12 with our five-echo test dataset.If I'm right, then this is a bug, although I don't recall if
computefeats2
is used for anything beyond generating component weight maps.NOTE: It is used to compute WTS in
fitmodels_direct
, which can impact computed metrics. The mixing matrix is normalized intedica
, so there is no meaningful impact on ICA component maps or metrics, but this isn't done fortedpca
, so there are small differences between metrics and more noticeable differences between component maps. Thetedpca
component maps almost look binarized because so many voxels are cropped.https://github.com/ME-ICA/tedana/blob/1bc32e46445f1d8c5b69ba2000280e2f8f591321/tedana/model/fit.py#L310-L318
However, before I try fixing this, I want to make sure that I'm interpreting the function correctly. Is anyone familiar enough with this function to take a look?
BTW, to fix the issue (and get betas equivalent to correlation coefficients), I would do the following: