computefeats2 does not calculate z-statistics accurately

tsalo commented 5 years ago

I believe that computefeats2 tries to calculate correlation coefficients between the input data and the mixing matrix (see below) by variance normalizing the data and using np.linalg.lstsq. It then crops extreme values (<-0.999 and >0.999) and converts the correlation coefficients to z-values. I assumed that the cropping was to prevent large z-values with correlation coefficients approaching 1 and -1. However, it doesn't look like normalization is doing what we want, and the "correlation coefficients" can in fact end up quite large. I added in a couple of lines to print out the max and min values of data_R before cropping, and got values as high as +/- 12 with our five-echo test dataset.

If I'm right, then this is a bug, although I don't recall if computefeats2 is used for anything beyond generating component weight maps.

NOTE: It is used to compute WTS in fitmodels_direct, which can impact computed metrics. The mixing matrix is normalized in tedica, so there is no meaningful impact on ICA component maps or metrics, but this isn't done for tedpca, so there are small differences between metrics and more noticeable differences between component maps. The tedpca component maps almost look binarized because so many voxels are cropped.

https://github.com/ME-ICA/tedana/blob/1bc32e46445f1d8c5b69ba2000280e2f8f591321/tedana/model/fit.py#L310-L318

However, before I try fixing this, I want to make sure that I'm interpreting the function correctly. Is anyone familiar enough with this function to take a look?

BTW, to fix the issue (and get betas equivalent to correlation coefficients), I would do the following:

data_z = stats.zscore(data[mask], axis=-1)
mmix_z = stats.zscore(mmix, axis=0)

# get betas of `data`~`mmix` and limit to range [-0.999, 0.999]
data_R = get_coeffs(data_z, mmix_z, mask=None)

tsalo commented 5 years ago

If we're interested in getting z-scores for the betas, then I think the following is an appropriate replacement for computefeats2:

def t_to_z(t_values, dof):
    """
    From Vanessa Sochat's TtoZ package.
    """
    # Select just the nonzero voxels
    nonzero = t_values[t_values != 0]

    # We will store our results here
    z_values = np.zeros(len(nonzero))

    # Select values less than or == 0, and greater than zero
    c = np.zeros(len(nonzero))
    k1 = (nonzero <= c)
    k2 = (nonzero > c)

    # Subset the data into two sets
    t1 = nonzero[k1]
    t2 = nonzero[k2]

    # Calculate p values for <=0
    p_values_t1 = stats.t.cdf(t1, df=dof)
    z_values_t1 = stats.norm.ppf(p_values_t1)
    z_values_t1[np.isinf(z_values_t1)] = stats.norm.ppf(1e-16)

    # Calculate p values for > 0
    p_values_t2 = stats.t.cdf(-t2, df=dof)
    z_values_t2 = -stats.norm.ppf(p_values_t2)
    z_values_t2[np.isinf(z_values_t2)] = -stats.norm.ppf(1e-16)
    z_values[k1] = z_values_t1
    z_values[k2] = z_values_t2

    # Write new image to file
    out = np.zeros(t_values.shape)
    out[t_values != 0] = z_values
    return out

def get_coeffs_and_tstats(X, y):
    """
    From https://stackoverflow.com/a/42677750/2589328
    """
    assert X.ndim == 2
    if y.ndim != 2:
        y = y[None, :]

    assert X.shape[0] == y.shape[-1]
    y = y.T
    X -= np.mean(X, axis=0)
    y -= np.mean(y, axis=0)

    slopes, _, _, _ = np.linalg.lstsq(X, y, rcond=None)
    df = X.shape[0] - (X.shape[1] + 1)
    pred = np.dot(X, slopes)

    # N DVs
    MSE = sum((y - pred) ** 2) / df
    # N IVs
    var_b = np.dot(np.atleast_2d(MSE).T,
                   np.atleast_2d(np.linalg.inv(np.dot(X.T, X)).diagonal()))
    sd_b = np.sqrt(var_b)
    slopes = slopes.T
    tstats = slopes / sd_b
    zstats = t_to_z(tstats, df)
    return slopes, zstats

tsalo commented 5 years ago

The problem with my proposed approach above is that test statistics for parameter estimates seem to be meaningless when the degrees of freedom are low. The t-statistics end up being grossly inflated (often in the millions to billions with dof = 1), and the z-statistics then end up being binarized at +/- 8 (the max value).

Here are a couple of possible solutions:

Nonparametric analysis of parameter estimates.
- We just run a bunch of iterations in which the time series for each component is randomized and then record that component's parameter estimate for each voxel. We then end up with a null distribution of parameter estimates per voxel and per component. The distributions definitely vary by voxel and component, even after z-scoring the mixing matrix and data, so I don't think we can cut corners there.
- This seems like a very defensible method, but is computationally intensive and doesn't seem feasible.
Treat standardized parameter estimates like correlation coefficients and convert them to z-values.
- This is sort of how it's currently implemented, except for a few things. First, both the mixing matrix and data need to be z-scored over time, at which point the parameter estimates should be standardized. Second, np.arctanh(r) just makes the correlation coefficient normally distributed, but we later compare those values to z-value thresholds, so we definitely need the test statistics. We need to do np.arctanh(r) * np.sqrt(n_trs - 3) for it to make any sense.
- I don't think this approach is very defensible. First, we're treating standardized partial regression coefficients like they're standardized regression coefficients from a bivariate regression. Second, we have to crop coefficients more extreme than +/- 0.999, which I don't love.
Use the properly calculated test statistics.
- This is a bad idea, unless we are more aggressive about dimensionality reduction (which will have the effect of increasing degrees of freedom and thus validity of the test statistics).

jbteves commented 5 years ago

Could you elaborate on point 1? If it's good but computationally intense, we can see if we're clever enough to find a way around that problem.

tsalo commented 5 years ago

I believe that it would look something like this [completely untested code]:

def computefeats2(data, mmix, mask, n_iters=1000):
    """
    Converts `data` to component space using `mmix`

    Parameters
    ----------
    data : (S x T) array_like
        Input data
    mmix : (T [x C]) array_like
        Mixing matrix for converting input data to component space, where `C`
        is components and `T` is the same as in `data`
    mask : (S,) array_like
        Boolean mask array
    n_iters : int, optional
        Number of iterations to use to build null distributions. Default is 1000.

    Returns
    -------
    data_Z : (S x C) :obj:`numpy.ndarray`
        Voxel-wise Z-statistics for each component
    """
    if data.ndim != 2:
        raise ValueError('Parameter data should be 2d, not {0}d'.format(data.ndim))
    elif mmix.ndim not in [2]:
        raise ValueError('Parameter mmix should be 2d, not '
                         '{0}d'.format(mmix.ndim))
    elif mask.ndim != 1:
        raise ValueError('Parameter mask should be 1d, not {0}d'.format(mask.ndim))
    elif data.shape[0] != mask.shape[0]:
        raise ValueError('First dimensions (number of samples) of data ({0}) '
                         'and mask ({1}) do not match.'.format(data.shape[0],
                                                               mask.shape[0]))
    elif data.shape[1] != mmix.shape[0]:
        raise ValueError('Second dimensions (number of volumes) of data ({0}) '
                         'and mmix ({1}) do not match.'.format(data.shape[0],
                                                               mmix.shape[0]))

    # z-score inputs to regression
    data_z = stats.zscore(data[mask], axis=-1)
    mmix_z = stats.zscore(mmix, axis=0)

    # get betas of `data`~`mmix` and limit to range [-0.999, 0.999]
    # (S x C)
    data_betas = get_coeffs(data_z, mmix_z, mask=None)
    betas_null = np.zeros(data_betas.shape + (n_iters,))  # S x C x 1000
    for i_iter in range(n_iters):
        for j_comp in range(mmix.shape[1]):
            temp_mmix = mmix_z.copy()
            idx = np.random.RandomState(seed=i_iter).permutation(mmix.shape[0])
            temp_mmix[:, j_comp] = temp_mmix[idx, j_comp]
            temp = get_coeffs(data_z, temp_mmix, mask=None)
            betas_null[:, j_comp, i_iter] = temp[:, j_comp]

    # Get z-values for betas using empirical nulls
    p_values = null_to_p(data_betas, betas_null, tail='two')
    signs = np.sign(data_betas - np.mean(betas_null, axis=-1))
    data_Z = p_to_z(p_values, tail='two')
    data_Z *= signs
    return data_Z

def null_to_p(test_value, null_array, tail='two'):
    """
    Return two-sided p-value for test value against null array.

    Parameters
    ----------
    test_value : :obj:`float`, :obj:`int`, or array_like
        Either a single value to compare to a 1D array or an ND array to
        compare to an N+1D array.
    null_array : array_like
        An array with one more dimension than test_value. The last dimension
        should be the permutations.
    tail : {'two', 'upper', 'lower'}
        Tails to use for the calculation of p-values.

    Returns
    -------
    p_value : :obj:`float` or array_like
        Either a single p-value or an array with the same shape as test_value
    """
    if isinstance(test_value, np.ndarray):
        assert null_array.ndim == (test_value.ndim + 1)
        values_1d = np.reshape(test_value, test_value.size)
        nulls_2d = np.reshape(null_array, (test_value.size, null_array.shape[-1]))
        p_values = np.zeros(values_1d.shape)
        z_values = np.zeros(values_1d.shape)
        for idx in range(values_1d.shape[0]):
            p_values[idx] = null_to_p(values_1d[idx], nulls_2d[idx, :], tail=tail)

        p_values = np.reshape(p_values, test_value.shape)
        return p_values

    if tail == 'two':
        p_value = (50 - np.abs(stats.percentileofscore(
            null_array, test_value) - 50.)) * 2. / 100.
    elif tail == 'upper':
        p_value = 1 - (stats.percentileofscore(null_array, test_value) / 100.)
    elif tail == 'lower':
        p_value = stats.percentileofscore(null_array, test_value) / 100.
    else:
        raise ValueError('Argument "tail" must be one of ["two", "upper", '
                         '"lower"]')
    return p_value

def p_to_z(p, tail='two'):
    """
    Convert p-values to z-values.

    Parameters
    ----------
    p : array_like
        Array of p-values
    tail : {'one', 'two'}
        Tails to use.

    Returns
    -------
    z : array_like
        Array of z-values
    """
    eps = np.spacing(1)
    p = np.array(p)
    p[p < eps] = eps
    if tail == 'two':
        z = ndtri(1 - (p / 2))
        z = np.array(z)
    elif tail == 'one':
        z = ndtri(1 - p)
        z = np.array(z)
        z[z < 0] = 0
    else:
        raise ValueError('Argument "tail" must be one of ["one", "two"]')

    if z.shape == ():
        z = z[()]
    return z

UPDATE: This takes about 4 minutes per iteration with 160 components. To build a null distribution, I think we'd want at least 1000 iterations.

tsalo commented 5 years ago

@smoia You've been digging into MELODIC's code a bit. Do you know how they convert their component maps into z-statistics?

tsalo commented 5 years ago

If we just use more aggressive dimensionality reduction in the PCA step prior to metric calculation, then the properly calculated statistics should be valid. We can do this by setting n_components to be a proportion of the variance. I think somewhere between 0.95 and 0.99 should remove enough low-variance components to give us reasonable degrees of freedom.

tsalo commented 5 years ago

If we want to go with the permutation approach, it looks like nilearn.mass_univariate.permuted_ols does what I was thinking, so it's an issue that's been dealt with already I guess. It would add a lot of time to the workflow, but we could maybe support it as an option.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions to tedana:tada: !

jbteves commented 4 years ago

This issue isn't as stale as you think, bot.

tsalo commented 4 years ago

This issue should be somewhat resolved by our shift to MA-PCA methods (see https://github.com/ME-ICA/tedana/issues/178#issuecomment-512807879), and we decided to refactor the regression function instead of using a nonparametric approach like permuted_ols.

We still need to do the refactor, which includes converting z-values to z-statistics.

tsalo commented 4 years ago

This is coming up as well in the BrainHack Donostia project ica-aroma-reorg, where we need a solid version of computefeats2 to reproduce MELODIC's outputs.

smoia commented 4 years ago

Still need the contribution? I can look into it!

tsalo commented 3 years ago

I think that the current solution is available here. @CesarCaballeroGaudes @smoia do you know what, if anything, still needs to be done to get it working?

EDIT: I'm happy to handle merge conflicts and any documentation tuning that needs to happen if the core code is working.

smoia commented 3 years ago

Hello, sorry for the delay! @tsalo I think the correct implementation is here. I can open a PR and see what's left to do to merge, what do you think?

tsalo commented 3 years ago

That would be amazing. Thank you!

ME-ICA / tedana

computefeats2 does not calculate z-statistics accurately #178