ME-ICA / tedana

TE-dependent analysis of multi-echo fMRI
https://tedana.readthedocs.io
GNU Lesser General Public License v2.1
161 stars 95 forks source link

Audit workflow arguments #162

Closed tsalo closed 5 years ago

tsalo commented 5 years ago

In addition to the arguments we're considering modifying or adding in #161, I think we should audit several of the other arguments in the workflow.

FastICA parameters

Now that we're using sklearn for the ICA, we get a warning in cases of non-convergence instead of an error. Overall, I think this is a good thing, but it also obscures potential problems. The two arguments to sklearn.decomposition.FastICA that are relevant are max_iter (maximum number of iterations) and tol (convergence tolerance). The default for tol is 0.0001, while our default is 0.000025, and is set by the conv argument. My questions here are, is conv something anyone ever plays with (i.e., do we actually need an argument) and is there a reason to use the much stricter default threshold we have over the default from sklearn? If not, then I propose that we drop the conv argument from tedana and let sklearn use its default in tedica.

On a related note, the default for maximum number of iterations is 200, while mdp's was 5000, IIRC. The problem is that FastICA is failing to converge for both of our test datasets (Cornell 3-echo rest and NIH 5-echo task). I think we should probably increase max_iter when we call FastICA to at least 1000, which will probably increase the amount of time tedica takes, but will probably also still be faster than mdp.

Other workflow parameters

I think that --filecsdata and --strict are currently unused. Unless we plan to re-implement them, I think we can drop them.

Does anyone use --sourceTEs, --denoiseTEs, or --stabilize?

handwerkerd commented 5 years ago

Just glancing at the code, it looks like --filecsdata and --strict were post meica v2.5 additions so it seems like, whatever they did may have been removed when the code was revised back towards v2.5. I don't have time right now to dig into what inforation was saved with --filecsdata, but if we're rethinking how to record component selection information, I concur that it's better to focus on doing this right than keeping a specific option label.

I haven't used --sourceTEs, but depending how it's implemented, I could see it being a useful option. For something like BIDS compliant data where you don't need to give it every echo's file name, it would be good to have an option that lets you run the algorithm using just a subset of the acquired echoes. I don't know how this option is implemented in practice, but I think it's a reasonable bit of functionality to keep.

Given that denoiseTEs and stabilize both of opaque functionality based on their names and seem to be partially subsumed by other options, I don't have any particular attachment to them.

tsalo commented 5 years ago

@handwerkerd Thanks for your input. I've opened a new PR with relevant changes (#163), but will keep --sourceTEs untouched.

Also, given that we want to make MLE the default component selection method for TEDPCA, we should revisit kdaw and rdaw as well, since they're only used for the decision tree approach. I don't know if we want arguments for parameters for a non-default method only non-power users will use.

tsalo commented 5 years ago

It actually seems like --sourceTEs is not equivalent to inputting a subset of the echoes to the full pipeline. The first time it's used is in tedpca, so all echoes are used for T2* estimation and optimal combination, as well as to fit dependence models in component selection. When --sourceTEs is used, tedpca is performed on the concatenated data for just the selected echoes, and returns a dimensionally reduced version of that concatenated data. Then, tedica seems to be performed on the dimensionally reduced concatenated data, rather than on a dimensionally reduced version of the optimally combined data (default) or of the full concatenated data (when --sourceTEs is set to -1).

It would be nice to know what running tedpca and tedica on concatenated data rather than optimally combined data is. Does anyone know what the conceptual basis is for the former (given that the latter is the default)?


On another note, in terms of revisiting rdaw and kdaw, how important are these arguments when using the decision tree version of TEDPCA? @dowdlelt and @handwerkerd, I believe that you two know the most about these arguments.

If removing the arguments is no good, what about combining them into a single, comma-separated argument? At minimum, we can place those arguments in their own section of the CLI argument documentation, to make it clear that they're only used if the decision tree is used.

dowdlelt commented 5 years ago

In my experience kdaw has a particularly dramatic impact on convergence. I primarily experimented with kdaw, given that it was easily modifiable when calling the function. In cases where the default kdaw resulted to far too many components (my data consist of >450 timepoints) I could reduce the default value to reduce the number of components. Sometime the problem would be the inverse. I did not like this fiddlelyness. The big reason is that I don't have a good understanding for what the parameters do, exactly. I also don't like subject specific, arbitrary choices in data processing.

Is it correct that they determine what the cut off will be for the thresholding on the PCA maps, for determining the number of PC components to keep for ICA? I base that on commit comments from bitbucket here -Reduced default kdaw to 5, and rdaw to 0 -New rdaw setting of 0, which is fixed F(1,ne-1) value correspoding to p<0.05

I am very much looking forward to not using them. That said, I think a single comma-separated parameter that can be added if one is going to all of the trouble of using TEDPCA instead of the new defaults.

emdupre commented 5 years ago

Sorry for being so long-delayed to this thread, and thanks for starting it @tsalo !! A few thoughts:

tsalo commented 5 years ago

@dowdlelt Yes, I think that they operate like that. I believe that increasing kdaw (rdaw) should decrease the Kappa (Rho) threshold, which in turn will increase the number of significant components from the PCA to keep for ICA. This is based on the following:

https://github.com/ME-ICA/tedana/blob/6e21c5de51b5cd0b1b11cc1c63fe91bed3ba97a0/tedana/decomposition/eigendecomp.py#L251-L252

First, the three threshold sources (fmin, the Kappa elbow, and fmid) are sorted in ascending order, so the weight for averaging from kdaw corresponds to the smallest of the three. If kdaw is greater than 1, then the smallest value will have the highest weight, leading to a lower Kappa threshold.

Of course, -1 is a special value for both kdaw and rdaw, but I don't quite understand the math that's used in that case. I also noticed that using negative values for weights in numpy.average gives nonsensical results (as discussed here), so we should probably check the argument range in the function.

Perhaps just collecting kdaw and rdaw in a decision tree-specific subgroup is enough.


@emdupre

tsalo commented 5 years ago

I think that everything discussed here so far has been handled by #163 and #164. The only other thing I was thinking we could work on is simplifying how manually accepted components are handled. For starters, I propose that we could move --manacc, --ctab, and --mix into a subgroup like "Arguments for re-running tedana" or something.

A more aggressive change (that I'm somewhat in favor of) would be to drop --ctab and --mix. Since the output files follow a naming convention, we know what the component table and mixing matrix files will be named when someone runs tedana twice on the same data. In cases where a user provides --manacc, tedana could just grab the component table and mixing matrix from the output directory, as long as they exist. I guess it just depends on what workflow people prefer. Would they rather run tedana on their input data with a label like "initial", manually select components, and then re-run with a label like "final" (which will generate some, but not all, of the tedana outputs, or would they like to just run tedana twice with the same label. We can always add relevant information to the logger to make it clear that files are being re-generated, so I don't see much of a cost to the latter approach.

jbteves commented 5 years ago

Hi all, trying to track some of the changes made between BitBucket 3.2.2 version and this one. Is the current version robust to kdaw and rdaw values? And if not, see https://github.com/ME-ICA/me-ica/issues/10-- at some point the default kdaw and rdaw were changed from there to here. (Seemed like the best place to put this comment-- will move if wrong).

emdupre commented 5 years ago

Thanks for confirming ! We've removed the kdaw and rdaw flags in https://github.com/ME-ICA/tedana/commit/69624dc91416f4046fbc1f34eff89d00faa0c243, in line with our Transparent and Reproducible Processing milestone (since it was difficult for users to form opinions on how / when to adjust those values).

tsalo commented 5 years ago

The default values of 10 and 1 for kdaw and rdaw, respectively, are now hardcoded here when users use the kundu option for --tedpca (which is no longer the default). Those values appear to be the same as the ones used in the most recent version of ME-ICA, which was updated about one year ago.

Our rationale for hardcoding these values is, as @emdupre said, because they're difficult to understand for users and because we've changed the default tedpca method to MLE dimensionality estimation. That said, power users could go into the code and edit these values as needed.

jbteves commented 5 years ago

Ah, I see, the values I have are from a different branch than master, v3.2 seen here. Has the project decided to retain only behavior from v3? I realize that the project has decided to not retain backwards compatibility for ME-ICA, but I thought it would be good to make a record of why this divergence exists.

tsalo commented 5 years ago

We don't support v3, although we may merge that back in in the future (as one option, not as the only method). We have shifted to only supporting v2.5 of the component selection algorithm. We have that information, and why we shifted, here in the FAQ section of the documentation site, although it's a fairly new addition.

jbteves commented 5 years ago

Okay, I see, sorry about all that.

emdupre commented 5 years ago

Thanks for checking on it ! It's good to have more people thinking through these choices, too :)

tsalo commented 5 years ago

I think that we can close this issue now. We may ultimately want to discuss removing or reorganizing --manacc, --ctab, and --mix, but I think that that would be better dealt with in a separate issue in order to prevent bloat.