NBCLab / power-replication

A replication and extension of Power et al. (2018)
https://www.overleaf.com/read/swgjxcjqytxg
Apache License 2.0
2 stars 0 forks source link

DDMRA results are dissimilar to original paper's findings #17

Open tsalo opened 2 years ago

tsalo commented 2 years ago

The DDMRA results I get are pretty dissimilar to the original paper's findings. While I expect some differences driven by differences in preprocessing and denoising methods, these go beyond my expectations.

From the replication manuscript, our predictions were:

Based on the findings of the original paper, intercepts and slopes were predicted to be statistically significant for the OC data and for the MEDN Noise data. Intercepts (but not slopes) were predicted to be significant for the MEDN data. Finally, neither intercepts nor slopes were predicted to be significant for the MEDN+GODEC and MEDN+GSR data.

We also hypothesized that the extended derivatives included in the analyses would follow similar patterns. Namely, we predicted that the intercepts, but not the slopes, would be statistically significant for the MEDN+GODEC Noise, MEDN+GSR Noise, MEDN+dGSR Noise, MEDN+aCompCor Noise, and MEDN+MIR Noise data. We also predicted that neither the intercepts nor the slopes would be significant for the MEDN+dGSR, MEDN+aCompCor, and MEDN+MIR data.

Analysis of Cambridge, CamCAN, and DuPre datasets

Here are the results using the Cambridge, CamCAN, and DuPre datasets, only dropping subjects with missing or zero-variance data in one or more ROIs (as well as subjects that had already been flagged from other steps). The table shows the p-values for the analyses.

QC:RSFC High-low Scrubbing
OC int. 0.9999 0.9990 0.0000
slope 0.6006 0.3664 0.0000
MEDN int. 1.0000 0.9991 0.0000
slope 0.8616 0.2138 0.0000
MEDN Noise int. 0.9992 0.9988 0.0000
slope 0.9605 0.9649 1.0000
MEDN+aCompCor int. 0.0000 0.0000 0.0000
slope 0.0000 0.0000 0.0000
MEDN+aCompCor Noise int. 1.0000 1.0000 1.0000
slope 1.0000 0.9987 1.0000
MEDN+GODEC int. 0.0000 0.0000 0.0000
slope 0.0014 0.0006 0.0000
MEDN+GODEC Noise int. 1.0000 1.0000 1.0000
slope 1.0000 1.0000 1.0000
MEDN+GSR int. 0.0786 0.0184 0.0000
slope 0.1146 0.0154 0.0000
MEDN+GSR Noise int. 1.0000 1.0000 1.0000
slope 1.0000 1.0000 1.0000
MEDN+MIR int. 0.3915 0.0014 0.0000
slope 0.0474 0.0000 0.0000
MEDN+MIR Noise int. 1.0000 1.0000 0.9741
slope 0.9999 1.0000 1.0000
MEDN+dGSR int. 0.4102 0.2016 0.0000
slope 0.2810 0.3201 0.0000
MEDN+dGSR Noise int. 1.0000 0.9780 0.9834
slope 0.9823 0.7149 0.8673

The code used to perform the analysis is this script, and the version of the DDMRA package I used is at https://github.com/tsalo/ddmra/tree/f9be4687d6465c09aae158509566083d456f2567.

Replication on just the Cambridge dataset

Just in case the problem was due to intersite differences (as mentioned in Power et al., 2017), I ran the analyses just using the Cambridge dataset, which should have produced very similar results to the original paper.

Note that the scrubbing analysis results still look terrible, but also the MEDN intercepts were nowhere near significant and the MEDN+GODEC QC:RSFC intercept was significant (and slope was nearly significant).

QC:RSFC High-low Scrubbing
OC int. 0.0026 0.0390 0.0000
slope 0.0036 0.0055 0.0007
MEDN int. 0.7697 0.6309 0.0000
slope 0.6040 0.3776 0.0000
MEDN Noise int. 0.0022 0.0166 0.0000
slope 0.0071 0.0018 1.0000
MEDN+aCompCor int. 0.0172 0.0697 0.0000
slope 0.1502 0.2623 0.0000
MEDN+aCompCor Noise int. 0.9806 0.9503 1.0000
slope 0.6464 0.3469 1.0000
MEDN+GODEC int. 0.0004 0.1383 0.0000
slope 0.0906 0.5315 0.0000
MEDN+GODEC Noise int. 0.9972 0.9665 0.9776
slope 0.8941 0.4836 0.0350
MEDN+GSR int. 0.2597 0.3426 0.0000
slope 0.2942 0.3766 0.0001
MEDN+GSR Noise int. 0.9842 0.9733 1.0000
slope 0.6605 0.3891 1.0000
MEDN+MIR int. 0.2531 0.3150 0.0000
slope 0.3228 0.3305 0.0000
MEDN+MIR Noise int. 0.0767 0.2449 0.0273
slope 0.0249 0.0278 1.0000
MEDN+dGSR int. 0.5025 0.2794 0.0000
slope 0.4939 0.4604 0.0000
MEDN+dGSR Noise int. 0.9688 0.8504 0.4810
slope 0.4127 0.3522 0.3476

Replication on subjects with mean FD < 0.2mm

Per Power et al., 2017, DDMRA analyses of multiple datasets can be corrupted by differences in baseline levels of motion, as well as outliers.

Specifically, what it says is:

A third way in which motion artifact can be overlooked is if a handful of outlying datasets are allowed to influence the scrubbing or QC:RSFC correlations calculated across many subjects; correlations are sensitive to outlying values and a few scans with marked abnormalities can obscure relationships present across most other datasets.

Also, in the original paper's supplement, they re-run the analyses with only subjects with mean FD < 0.3 and < 0.2 mm, separately, and say that the results were basically the same. I was concerned that the CamCAN dataset, which has higher levels of motion (probably because there are older participants), might be causing problems in the main analysis, so I ran the supplement's analysis of only subjects with < 0.2mm mean FD. However, the results are still quite dissimilar from the original paper.

I don't know if there is a problem with the analyses/dataset or if the CamCAN dataset is simply driving different, but still valid, results.

QC:RSFC High-low Scrubbing
OC int. 0.9980 0.9973 0.0000
slope 0.2761 0.5359 0.0000
MEDN int. 0.8211 0.7266 0.0000
slope 0.0272 0.1078 0.0000
MEDN Noise int. 0.9998 1.0000 0.0526
slope 0.9406 0.9796 1.0000
MEDN+aCompCor int. 0.0171 0.0153 0.0000
slope 0.0067 0.0181 0.0000
MEDN+aCompCor Noise int. 0.9993 0.9985 1.0000
slope 0.4379 0.8967 1.0000
MEDN+GODEC int. 0.0000 0.0001 0.0000
slope 0.0239 0.0565 0.0000
MEDN+GODEC Noise int. 1.0000 1.0000 1.0000
slope 0.9495 0.9789 1.0000
MEDN+GSR int. 0.0160 0.0641 0.0000
slope 0.0086 0.0432 0.0000
MEDN+GSR Noise int. 1.0000 1.0000 1.0000
slope 0.8618 0.9847 1.0000
MEDN+MIR int. 0.0002 0.0006 0.0000
slope 0.0000 0.0010 0.0000
MEDN+MIR Noise int. 1.0000 1.0000 1.0000
slope 0.9946 0.9985 1.0000
MEDN+dGSR int. 0.1122 0.0267 0.0000
slope 0.2781 0.2620 0.0000
MEDN+dGSR Noise int. 0.6137 0.6661 0.1692
slope 0.0455 0.5208 0.1339
tsalo commented 2 years ago

Based on a conversation with Jonathan Power, here are some ideas about what could be causing the issues:

  1. Different datasets may exhibit different motion patterns, since motion parameters are influenced by sampling rate.
    • This is probably not the case here, because CamCAN and Cambridge have the same TR and are not MB accelerated. They also have very similar mean FD distributions.
  2. Runs exhibiting more respiration-driven motion may be affecting the results, so it's a good idea to try to exclude these runs from the analyses. Respiration-driven motion is characterized by continuous-looking framewise displacement time series, rather than the more erratic patterns found in purely head motion-driven movement parameters.
    • One idea was that median FD would reflect respiration-driven motion more than mean FD, since framewise displacement would always be oscillating around medium levels, rather than staying near zero and spiking with head motions. Therefore, I could look for "humps" in the distribution of median FD, separately for each dataset, to determine appropriate dataset-specific thresholds for the low-motion version of the analyses.
  3. Outliers might be driving the results. These might be apparent in the distributions of ROI-to-ROI correlation coefficients, for each edge, across runs. The z-transformed coefficients should be normally distributed, so the issue would be identifying outliers in ~35k normally distributed sets of ~700 data points, which isn't a typical multivariate outlier detection issue, since the number of variables is so much higher than the number of data points.
tsalo commented 2 years ago

I've been trying to figure out how to identify outlier scans across the edges. Mahalanobis distance, which is probably the optimal choice for data with more samples than variables, won't work in this case because there are definitely more variables than samples. If no one is able to provide any recommendations, I'm leaning toward switching to a different distance measure (e.g., Euclidean) and flagging any subjects that are >2 SD from the mean distance. @handwerkerd does that make sense to you?

The other idea, to look at median motion and identify dataset-specific thresholds, hasn't really borne out. I have plotted median and mean motion across datasets, and I don't see any clearly identifiable peaks on the right sides of the median FD distributions. Below is a figure with the KDE plots for the two measures, across the three datasets we use in Experiment 2.

motion_distributions

tsalo commented 2 years ago

Unfortunately, I never heard back from the person I emailed, so I'm thinking I'll use a combination of PCA and Mahalanobis distance to identify outliers. The idea is to use PCA to reduce the 700x35000 array to 700x5 (or something else), and then calculate distance on that, as in this example. After that, I can apply a p < 0.001 threshold based on the chi2 distribution to identify outliers.

Does that sound reasonable?

EDIT: The problem I'm noticing is that the first 5-10 components only explain about 5-6% of the variance, but trying to use the 500 or so that explain 90-95% of the variance leads to a bunch of warnings during the Mahalanobis calculation. Also, even using a threshold of p < 0.001 tends to leave me with a lot of outliers (80-140).

handwerkerd commented 2 years ago

Sorry for a bit of a delay responding. You write, "The z-transformed coefficients should be normally distributed" Is this true? I'm thinking of Fig 4 of Murphy et al 2009. It is correlations to a seed rather than whole brain correlations, but the distributions are very much not normally distributed until you do a regularization step, like global signal regression.

That said something like Mahalanobis distance would be good to plot the distance between each dataset and a normal distribution, which seems to be what you're planning to do.

I don't completely follow what your samples and variables are in this specific analysis, but, instead of fitting to a model based on all your conditions, could you just calculate the distance between all (or many) pairs of distributions and cluster each data set based on those distances. Then outliers might appear as clusters that are farther away from other datasets than expected.

Hope this helps.

tsalo commented 2 years ago

You write, "The z-transformed coefficients should be normally distributed" Is this true?

I should have said that the Fisher's z transform enforces a roughly normal distribution for random correlation coefficients, rather than saying that the transformed z values will be normal. Still, I think it's safe to work under the assumption that the ROI-to-ROI coefficients will be approximately normal after being transformed.

I don't completely follow what your samples and variables are in this specific analysis, but, instead of fitting to a model based on all your conditions, could you just calculate the distance between all (or many) pairs of distributions and cluster each data set based on those distances. Then outliers might appear as clusters that are farther away from other datasets than expected.

In this case, the samples are all of the subjects going into the DDMRA analyses, while the variables are the ROI-to-ROI correlation coefficients for all of the Power ROI pairs. The DDMRA analyses are performed across subjects on each ROI-to-ROI edge, and then a sliding window average is performed across those edges, according to the distance between the ROIs, to get the smoothing curves we see if the DDMRA figures, like this one.

I'm not sure if finding outliers using subsets of the ROI pairs will work, unless I have some way of aggregating those findings across all of the pairs (e.g., subjects who are outliers in X of the 35000 would be removed from the analysis). I think using dimensionality reduction via PCA makes sense, and has some precedent, though I'm less sure about the thresholds and number of components to use.

handwerkerd commented 2 years ago

It might not be of direct relevance, but Javier & I used PCA to reduce connectivity matrices in https://www.pnas.org/content/112/28/8762#F11 and Sup Fig 5 shows how much cognitive state-relevant info survives for different levels of variance retained by the PCA.

I haven't personally calculated DDMRA measures so this might be my own confusion, but sometimes it's better to identify outliers a step or two away from your measure of interest. I'm sticking with the idea that you want to identify weird distributions compared to other subjects. One possible way to reduce that space would be to calculate the average connectiviy matrix for each dataset. Then, for each subject, you'd be calculating the distance from that average. That wouldn't let you cluster groups of subjects with similar patterns, but it would highlight any subjects that are particularly far from the mean.

tsalo commented 2 years ago

I haven't personally calculated DDMRA measures so this might be my own confusion, but sometimes it's better to identify outliers a step or two away from your measure of interest.

I agree. In this case, the measure of interest is generally going to be a correlation of correlation coefficients against mean framewise displacement, a difference of correlation coefficients between high and low motion subjects, or a difference of correlation coefficients between the raw time series and the the censored version. What I tried to reject outliers from is the correlation coefficients going into each of those.

One possible way to reduce that space would be to calculate the average connectiviy matrix for each dataset. Then, for each subject, you'd be calculating the distance from that average.

That's sort of what the Mahalanobis distance was doing, but it's for multivariate normally distributed data. I was thinking of Euclidean distance from the average matrix because Mahalanobis doesn't work directly when the number of variables > number of samples, but it seemed like PCA + Mahalanobis was more established since I could find examples of its use.

However... the results were still extremely significant/nonsignificant when I removed outliers, and I'm totally lost as to what to do at this point. Outlier correlation coefficients don't seem be driving the significant findings, so I wonder if there's just a problem with how the significance tests are done.

emdupre commented 2 years ago

Have you re-generated these for individual datasets, rather than collapsed across @tsalo ? I believe you mentioned that for Cambridge only the DDMRA findings were similar to the original work. is that still true ?

tsalo commented 2 years ago

I ran the dataset-specific analyses, but I haven't reviewed them yet. I'll update this comment when I have something to show.