Allow for correlations with regressors in decision tree

tsalo commented 4 months ago

Closes #1009. This is an alternative approach to #1008.

I think we need to support both hardcoded regressor-metric names and wildcard versions (i.e., "*_correlation"). The former allows users to set different thresholds for different regressors, while the latter prevents unnecessary duplication in the decision trees if users want to use a consistent threshold across all regressors.

Changes proposed in this pull request:

Create a new module, metrics.external, for calculating metrics from external regressors.
- Incorporate the external regressor correlations into the metric collection function.
Add a new parameter to the tedana CLI, --external, to pass in a TSV with external regressors.
Draft decision trees that incorporate external regressor correlation metrics.

To do:

[ ] Update an integration test to cover the new behavior.
[ ] Add unit tests to cover the new functions.
[ ] The function that checks if necessary metrics were used doesn't account for regular expressions.
[ ] Autocorrelation-adjusted p-values?
[ ] Summary metric from motion parameters?

codecov[bot] commented 4 months ago

Codecov Report

Attention: 20 lines in your changes are missing coverage. Please review.

Comparison is base (d5512f7) 89.55% compared to head (2327120) 89.18%. Report is 1 commits behind head on main.

Files	Patch %	Lines
tedana/selection/selection_utils.py	57.77%	15 Missing and 4 partials :warning:
tedana/workflows/tedana.py	80.00%	0 Missing and 1 partial :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #1021 +/- ## ========================================== - Coverage 89.55% 89.18% -0.38% ========================================== Files 26 27 +1 Lines 3466 3531 +65 Branches 634 655 +21 ========================================== + Hits 3104 3149 +45 - Misses 210 225 +15 - Partials 152 157 +5 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

handwerkerd commented 4 months ago

Thank you for getting this set up! At first glance, this looks good except for two things I want to discuss/suggest:

What numbers do we want to be stored as metrics?

Currently, this is storing the correlation coefficients. We might want to also store the p values. In my earlier testing, both were potentially useful in decision trees. Since we can feed the number of time points into the decision tree, we could get a decent p value from the correlation coefficients within the decision tree code, but we'd need to write something to make that internal calculation easy.

Do we want statistics for combinations of regressors?

In most practical use cases, I think we do. That is, we have 6 or 12 head motion regressors. We don't want a decision tree to take all of those as separate inputs. We want a single metric to test whether a component time series correlates to head motion. This is what I set up in my stand-alone testing for this, and it's not too hard to do. The big issue is that one needs a full model for all regressors along with added regressors for low frequency drifts (also good even for correlation).

The main quirk is that a flexible approach to combining regressors requires a column naming structure to show how groups of regressors should be combined. In my code each column label had a prefix and suffix and regress_dict parses how they should be defined. That could work, but might be a bit ugly. If TSV files could have two column labels that could work. Otherwise, a separate json file that defines and labels groups of columns be a good solution. Thoughts?

tsalo commented 4 months ago

We might want to also store the p values. In my earlier testing, both were potentially useful in decision trees. Since we can feed the number of time points into the decision tree, we could get a decent p value from the correlation coefficients within the decision tree code, but we'd need to write something to make that internal calculation easy.

P-values could be useful, but if we want meaningful ones we should probably account for autocorrelation in the time series (e.g., by using xDF).

We want a single metric to test whether a component time series correlates to head motion.

In that scenario I think we just need to bite the bullet and expect users to provide their own regressors. If there are specific metrics we can pull from the literature, then it might make sense to implement them in tedana, but I would prefer to avoid trying to support infinite flexibility within tedana. We can also incorporate more advanced external metrics in something like fMRIPost-tedana, rather than trying to put them in tedana.

handwerkerd commented 4 months ago

I don't think this is that much more complex to what you have. Right now someone can provide a tsv file with 16 regressors. The added layer is to provide stats for combined groups of regressors. I already have code that does this and it's not particularly long or complex. The only issue is how to define groups of regressors.

I do like the idea for adding something to account for temporal autocorrelation. In my previous work on this, too many things were statistically significant and the overestimation of degrees of freedom almost definitely played a role. I haven't used xDF before, but it might be good to find something fairly light even if it's not as robust. I'll look a bit more into xDF and other options and might ask the AFNI team.

tsalo commented 4 months ago

I don't think this is that much more complex to what you have. Right now someone can provide a tsv file with 16 regressors. The added layer is to provide stats for combined groups of regressors. I already have code that does this and it's not particularly long or complex. The only issue is how to define groups of regressors.

I have to imagine there are many ways to do this though. E.g., calculating a simple mean vs. sum of squares vs. a formula like FD. Can we commit to adding one option and requiring users to calculate their own regressors if they want anything different?

I'll look a bit more into xDF and other options and might ask the AFNI team.

Thanks! Ted and I have actually been talking with Tom Nichols and Taki Shinohara about extending xDF to work on discontinuous time series (i.e., censored time series), so there should be some forward momentum on maintaining the tool in the near future.

handwerkerd commented 4 months ago

I spoke to Gang Chen from the AFNI team about this degrees of freedom discussion. His perspective, which I agree with, is, for this application, estimating autocorrelation and getting an accurate value for degrees of freedom isn't worth the effort. We are creating semi-arbitrary thresholds to estimate whether a component should be rejected. Most runs will have over 150 time points. If a DOF correct method says p<0.05 is r>0.159 (150 DOF) vs r>0.195 (100 DOF) it's not a huge difference. Given most runs are now more than 150 volumes and a DOF correction for non-temporally smoothed data won't have us losing 1/3 of our DOF, the statistical threshold different will be really small. For F values, the DOF difference between 120 and infinity is around 0.04.

How we create our decision criteria and select our arbitrary threshold will dwarf any effect of having the DOF estimate slightly too large.

handwerkerd commented 4 months ago

I would like to switch from correlation to T and F values which would make polynomial detrending possible. This is critical since we don't want a slow drift to mean a head motion line doesn't have a high correlation. An F stat would make it possible to combine regressors into a single stat (pending an agreement on how to parse which regressors to combine).

I will try to block off some time to work on this tomorrow & then see what you think.

tsalo commented 4 months ago

I've changed the target branch from ME-ICA@main to ME-ICA@enh/external-regressors. We can merge this PR whenever we want and then open new PRs to that new branch to make further changes.

handwerkerd commented 3 months ago

I think #1064 replaces this and that one is a branch of /ME-ICA/tedana so it will be a bit easier for several of us to add stuff to if. @tsalo If you agree, then close this PR.

BTW, looking at this PR, it seems like I'm given the option to squash and merge with no passing reviews and no warnings. I'm not sure if something got off with those settings or it's just a one-off oddity.

eurunuela commented 3 months ago

BTW, looking at this PR, it seems like I'm given the option to squash and merge with no passing reviews and no warnings. I'm not sure if something got off with those settings or it's just a one-off oddity.

You're right, it does allow me to squash and merge too...

Edit: ah, this is because you're trying to merge into ME-ICA:enh/external-regressors. The branch protection rules only apply to main.

ME-ICA / tedana

Allow for correlations with regressors in decision tree #1021

Codecov Report