Audit workflow arguments

tsalo commented 5 years ago

In addition to the arguments we're considering modifying or adding in #161, I think we should audit several of the other arguments in the workflow.

FastICA parameters

Now that we're using sklearn for the ICA, we get a warning in cases of non-convergence instead of an error. Overall, I think this is a good thing, but it also obscures potential problems. The two arguments to sklearn.decomposition.FastICA that are relevant are max_iter (maximum number of iterations) and tol (convergence tolerance). The default for tol is 0.0001, while our default is 0.000025, and is set by the conv argument. My questions here are, is conv something anyone ever plays with (i.e., do we actually need an argument) and is there a reason to use the much stricter default threshold we have over the default from sklearn? If not, then I propose that we drop the conv argument from tedana and let sklearn use its default in tedica.

On a related note, the default for maximum number of iterations is 200, while mdp's was 5000, IIRC. The problem is that FastICA is failing to converge for both of our test datasets (Cornell 3-echo rest and NIH 5-echo task). I think we should probably increase max_iter when we call FastICA to at least 1000, which will probably increase the amount of time tedica takes, but will probably also still be faster than mdp.

Other workflow parameters

I think that --filecsdata and --strict are currently unused. Unless we plan to re-implement them, I think we can drop them.

Does anyone use --sourceTEs, --denoiseTEs, or --stabilize?

I don't recall anyone ever talking about sourceTEs or denoiseTEs, but neither seems very useful, unless I'm misunderstanding them.
sourceTEs: While I maybe understand switching between using catd vs optcom for fitting models (though I've never tried the former), if we wanted to fit to just some of the echoes, why not just feed those echoes into the workflow?
denoiseTEs: This just determines whether additional files are written out. Unless they're regularly used, I propose that we incorporate this switch into a general verbose argument as we've been discussing in #17.
Per #101, it looks like we will be making MLEPCA the default (with no PCA decision tree), which means that stabilize won't have an impact on the default pipeline. I also don't know if it has any meaningful impact on the results in most cases when the decision tree is used. I propose that we simply treat it as a variant of the decision tree. So, when we do split the decision tree from MLEPCA, we will have some argument like --tedpca that can take values of mle, kundu, and kundu-stabilize, with mle being the default. Does that sound good?

handwerkerd commented 5 years ago

Just glancing at the code, it looks like --filecsdata and --strict were post meica v2.5 additions so it seems like, whatever they did may have been removed when the code was revised back towards v2.5. I don't have time right now to dig into what inforation was saved with --filecsdata, but if we're rethinking how to record component selection information, I concur that it's better to focus on doing this right than keeping a specific option label.

I haven't used --sourceTEs, but depending how it's implemented, I could see it being a useful option. For something like BIDS compliant data where you don't need to give it every echo's file name, it would be good to have an option that lets you run the algorithm using just a subset of the acquired echoes. I don't know how this option is implemented in practice, but I think it's a reasonable bit of functionality to keep.

Given that denoiseTEs and stabilize both of opaque functionality based on their names and seem to be partially subsumed by other options, I don't have any particular attachment to them.

tsalo commented 5 years ago

@handwerkerd Thanks for your input. I've opened a new PR with relevant changes (#163), but will keep --sourceTEs untouched.

Also, given that we want to make MLE the default component selection method for TEDPCA, we should revisit kdaw and rdaw as well, since they're only used for the decision tree approach. I don't know if we want arguments for parameters for a non-default method only non-power users will use.

tsalo commented 5 years ago

It actually seems like --sourceTEs is not equivalent to inputting a subset of the echoes to the full pipeline. The first time it's used is in tedpca, so all echoes are used for T2* estimation and optimal combination, as well as to fit dependence models in component selection. When --sourceTEs is used, tedpca is performed on the concatenated data for just the selected echoes, and returns a dimensionally reduced version of that concatenated data. Then, tedica seems to be performed on the dimensionally reduced concatenated data, rather than on a dimensionally reduced version of the optimally combined data (default) or of the full concatenated data (when --sourceTEs is set to -1).

It would be nice to know what running tedpca and tedica on concatenated data rather than optimally combined data is. Does anyone know what the conceptual basis is for the former (given that the latter is the default)?

On another note, in terms of revisiting rdaw and kdaw, how important are these arguments when using the decision tree version of TEDPCA? @dowdlelt and @handwerkerd, I believe that you two know the most about these arguments.

If removing the arguments is no good, what about combining them into a single, comma-separated argument? At minimum, we can place those arguments in their own section of the CLI argument documentation, to make it clear that they're only used if the decision tree is used.

dowdlelt commented 5 years ago

In my experience kdaw has a particularly dramatic impact on convergence. I primarily experimented with kdaw, given that it was easily modifiable when calling the function. In cases where the default kdaw resulted to far too many components (my data consist of >450 timepoints) I could reduce the default value to reduce the number of components. Sometime the problem would be the inverse. I did not like this fiddlelyness. The big reason is that I don't have a good understanding for what the parameters do, exactly. I also don't like subject specific, arbitrary choices in data processing.

Is it correct that they determine what the cut off will be for the thresholding on the PCA maps, for determining the number of PC components to keep for ICA? I base that on commit comments from bitbucket here -Reduced default kdaw to 5, and rdaw to 0 -New rdaw setting of 0, which is fixed F(1,ne-1) value correspoding to p<0.05

I am very much looking forward to not using them. That said, I think a single comma-separated parameter that can be added if one is going to all of the trouble of using TEDPCA instead of the new defaults.

emdupre commented 5 years ago

Sorry for being so long-delayed to this thread, and thanks for starting it @tsalo !! A few thoughts:

I am a big fan of cleaning up the workflow arguments, especially since many of them do not align with our milestone of transparency (as evidenced by the confusion even in these discussions). To recap what I think has been agreed on so far:
- dropping the conv argument and using sklearn defaults
- dropping the --strict argument, since we've reverted to the v2.5 decision tree
- dropping the --filecsdata argument, which would explicitly write out the component selection data. The previous default was False, but I believe our stance is that the default should be True so having this argument and verbose is confusing
It also sounds like we're converging on:
- dropping the --kdaw and --rdaw flags, both of which were dimensionality augmentation ratings for the two parameters and very difficult to interpret in a systematic fashion
I'm revisiting the discussion in #101 and now I'm not sure that I'd like to keep the decision tree :laughing: But assuming we do, I think I'd rather look at why --stabilize is used, since it only seems to add a few more rejection criteria. Personally, I'm not sure it would merit its own tree... I'd like to see it instead incorporated into an improved TEDPCA decision tree. Here, of course, the question is how to test those decisions :)
Re --sourceTEs, my understanding is something like this:
- there is no benefit to dropping echos from the T2 map calculation, since you're penalizing yourself (and it may not even be possible to calculate a T2 map if you drop too many echos).
- But, when you sub-select TEs for the component generation, then you're trying to find the dimensions that are shared across those components -- which you can evaluate based on thresholds from the T2* map.
This brings up the question of why you'd be interested in dimensions shared across some echos and not others. I've never used this option myself, so I'd have to think on it a bit longer, but not sure if someone who has experience with the option (or a better intuition for it) wants to chime in !

tsalo commented 5 years ago

@dowdlelt Yes, I think that they operate like that. I believe that increasing kdaw (rdaw) should decrease the Kappa (Rho) threshold, which in turn will increase the number of significant components from the PCA to keep for ICA. This is based on the following:

https://github.com/ME-ICA/tedana/blob/6e21c5de51b5cd0b1b11cc1c63fe91bed3ba97a0/tedana/decomposition/eigendecomp.py#L251-L252

First, the three threshold sources (fmin, the Kappa elbow, and fmid) are sorted in ascending order, so the weight for averaging from kdaw corresponds to the smallest of the three. If kdaw is greater than 1, then the smallest value will have the highest weight, leading to a lower Kappa threshold.

Of course, -1 is a special value for both kdaw and rdaw, but I don't quite understand the math that's used in that case. I also noticed that using negative values for weights in numpy.average gives nonsensical results (as discussed here), so we should probably check the argument range in the function.

Perhaps just collecting kdaw and rdaw in a decision tree-specific subgroup is enough.

@emdupre

I think that we need the decision tree option as a template for treating tedana as the development sandbox we want it to be. Besides, in the (albeit limited) test datasets we're using, MLE doesn't do anything. The original argument in #101 is that the decision tree is returning too many components, but I'm seeing the opposite.
- In any case, I think we should wait until we've done some quantitative comparisons in the tedana-comparison repository (possibly associated with a paper corresponding to a major release of tedana) to drop the decision tree, since the decision tree is at least citable, unlike v3.2 of TEDICA.
With the way I've set up the code in #164, it would be easy enough to drop the stabilize option.
To be honest, your description of sourceTEs is going over my head a bit. Do you know what one might get from using the concatenated data vs. the optimally combined data there (totally ignoring the possibility of selecting a subset of the echoes)?

tsalo commented 5 years ago

I think that everything discussed here so far has been handled by #163 and #164. The only other thing I was thinking we could work on is simplifying how manually accepted components are handled. For starters, I propose that we could move --manacc, --ctab, and --mix into a subgroup like "Arguments for re-running tedana" or something.

A more aggressive change (that I'm somewhat in favor of) would be to drop --ctab and --mix. Since the output files follow a naming convention, we know what the component table and mixing matrix files will be named when someone runs tedana twice on the same data. In cases where a user provides --manacc, tedana could just grab the component table and mixing matrix from the output directory, as long as they exist. I guess it just depends on what workflow people prefer. Would they rather run tedana on their input data with a label like "initial", manually select components, and then re-run with a label like "final" (which will generate some, but not all, of the tedana outputs, or would they like to just run tedana twice with the same label. We can always add relevant information to the logger to make it clear that files are being re-generated, so I don't see much of a cost to the latter approach.

jbteves commented 5 years ago

Hi all, trying to track some of the changes made between BitBucket 3.2.2 version and this one. Is the current version robust to kdaw and rdaw values? And if not, see https://github.com/ME-ICA/me-ica/issues/10-- at some point the default kdaw and rdaw were changed from there to here. (Seemed like the best place to put this comment-- will move if wrong).

emdupre commented 5 years ago

Thanks for confirming ! We've removed the kdaw and rdaw flags in https://github.com/ME-ICA/tedana/commit/69624dc91416f4046fbc1f34eff89d00faa0c243, in line with our Transparent and Reproducible Processing milestone (since it was difficult for users to form opinions on how / when to adjust those values).

tsalo commented 5 years ago

The default values of 10 and 1 for kdaw and rdaw, respectively, are now hardcoded here when users use the kundu option for --tedpca (which is no longer the default). Those values appear to be the same as the ones used in the most recent version of ME-ICA, which was updated about one year ago.

Our rationale for hardcoding these values is, as @emdupre said, because they're difficult to understand for users and because we've changed the default tedpca method to MLE dimensionality estimation. That said, power users could go into the code and edit these values as needed.

jbteves commented 5 years ago

Ah, I see, the values I have are from a different branch than master, v3.2 seen here. Has the project decided to retain only behavior from v3? I realize that the project has decided to not retain backwards compatibility for ME-ICA, but I thought it would be good to make a record of why this divergence exists.

tsalo commented 5 years ago

We don't support v3, although we may merge that back in in the future (as one option, not as the only method). We have shifted to only supporting v2.5 of the component selection algorithm. We have that information, and why we shifted, here in the FAQ section of the documentation site, although it's a fairly new addition.

jbteves commented 5 years ago

Okay, I see, sorry about all that.

emdupre commented 5 years ago

Thanks for checking on it ! It's good to have more people thinking through these choices, too :)

tsalo commented 5 years ago

I think that we can close this issue now. We may ultimately want to discuss removing or reorganizing --manacc, --ctab, and --mix, but I think that that would be better dealt with in a separate issue in order to prevent bloat.

ME-ICA / tedana

Audit workflow arguments #162

FastICA parameters

Other workflow parameters