Closed tsalo closed 5 years ago
Just glancing at the code, it looks like --filecsdata and --strict were post meica v2.5 additions so it seems like, whatever they did may have been removed when the code was revised back towards v2.5. I don't have time right now to dig into what inforation was saved with --filecsdata, but if we're rethinking how to record component selection information, I concur that it's better to focus on doing this right than keeping a specific option label.
I haven't used --sourceTEs, but depending how it's implemented, I could see it being a useful option. For something like BIDS compliant data where you don't need to give it every echo's file name, it would be good to have an option that lets you run the algorithm using just a subset of the acquired echoes. I don't know how this option is implemented in practice, but I think it's a reasonable bit of functionality to keep.
Given that denoiseTEs and stabilize both of opaque functionality based on their names and seem to be partially subsumed by other options, I don't have any particular attachment to them.
@handwerkerd Thanks for your input. I've opened a new PR with relevant changes (#163), but will keep --sourceTEs
untouched.
Also, given that we want to make MLE the default component selection method for TEDPCA, we should revisit kdaw
and rdaw
as well, since they're only used for the decision tree approach. I don't know if we want arguments for parameters for a non-default method only non-power users will use.
It actually seems like --sourceTEs
is not equivalent to inputting a subset of the echoes to the full pipeline. The first time it's used is in tedpca
, so all echoes are used for T2* estimation and optimal combination, as well as to fit dependence models in component selection. When --sourceTEs
is used, tedpca
is performed on the concatenated data for just the selected echoes, and returns a dimensionally reduced version of that concatenated data. Then, tedica
seems to be performed on the dimensionally reduced concatenated data, rather than on a dimensionally reduced version of the optimally combined data (default) or of the full concatenated data (when --sourceTEs
is set to -1).
It would be nice to know what running tedpca
and tedica
on concatenated data rather than optimally combined data is. Does anyone know what the conceptual basis is for the former (given that the latter is the default)?
On another note, in terms of revisiting rdaw
and kdaw
, how important are these arguments when using the decision tree version of TEDPCA? @dowdlelt and @handwerkerd, I believe that you two know the most about these arguments.
If removing the arguments is no good, what about combining them into a single, comma-separated argument? At minimum, we can place those arguments in their own section of the CLI argument documentation, to make it clear that they're only used if the decision tree is used.
In my experience kdaw has a particularly dramatic impact on convergence. I primarily experimented with kdaw, given that it was easily modifiable when calling the function. In cases where the default kdaw resulted to far too many components (my data consist of >450 timepoints) I could reduce the default value to reduce the number of components. Sometime the problem would be the inverse. I did not like this fiddlelyness. The big reason is that I don't have a good understanding for what the parameters do, exactly. I also don't like subject specific, arbitrary choices in data processing.
Is it correct that they determine what the cut off will be for the thresholding on the PCA maps, for determining the number of PC components to keep for ICA? I base that on commit comments from bitbucket here
-Reduced default kdaw to 5, and rdaw to 0 -New rdaw setting of 0, which is fixed F(1,ne-1) value correspoding to p<0.05
I am very much looking forward to not using them. That said, I think a single comma-separated parameter that can be added if one is going to all of the trouble of using TEDPCA instead of the new defaults.
Sorry for being so long-delayed to this thread, and thanks for starting it @tsalo !! A few thoughts:
I am a big fan of cleaning up the workflow arguments, especially since many of them do not align with our milestone of transparency (as evidenced by the confusion even in these discussions). To recap what I think has been agreed on so far:
conv
argument and using sklearn
defaults--strict
argument, since we've reverted to the v2.5 decision tree--filecsdata
argument, which would explicitly write out the component selection data. The previous default was False, but I believe our stance is that the default should be True so having this argument and verbose is confusingIt also sounds like we're converging on:
--kdaw
and --rdaw
flags, both of which were dimensionality augmentation ratings for the two parameters and very difficult to interpret in a systematic fashionI'm revisiting the discussion in #101 and now I'm not sure that I'd like to keep the decision tree :laughing: But assuming we do, I think I'd rather look at why --stabilize
is used, since it only seems to add a few more rejection criteria. Personally, I'm not sure it would merit its own tree... I'd like to see it instead incorporated into an improved TEDPCA decision tree. Here, of course, the question is how to test those decisions :)
Re --sourceTEs
, my understanding is something like this:
This brings up the question of why you'd be interested in dimensions shared across some echos and not others. I've never used this option myself, so I'd have to think on it a bit longer, but not sure if someone who has experience with the option (or a better intuition for it) wants to chime in !
@dowdlelt Yes, I think that they operate like that. I believe that increasing kdaw
(rdaw
) should decrease the Kappa (Rho) threshold, which in turn will increase the number of significant components from the PCA to keep for ICA. This is based on the following:
First, the three threshold sources (fmin
, the Kappa elbow, and fmid
) are sorted in ascending order, so the weight for averaging from kdaw
corresponds to the smallest of the three. If kdaw
is greater than 1, then the smallest value will have the highest weight, leading to a lower Kappa threshold.
Of course, -1 is a special value for both kdaw
and rdaw
, but I don't quite understand the math that's used in that case. I also noticed that using negative values for weights in numpy.average
gives nonsensical results (as discussed here), so we should probably check the argument range in the function.
Perhaps just collecting kdaw
and rdaw
in a decision tree-specific subgroup is enough.
@emdupre
tedana
as the development sandbox we want it to be. Besides, in the (albeit limited) test datasets we're using, MLE doesn't do anything. The original argument in #101 is that the decision tree is returning too many components, but I'm seeing the opposite.
tedana-comparison
repository (possibly associated with a paper corresponding to a major release of tedana
) to drop the decision tree, since the decision tree is at least citable, unlike v3.2 of TEDICA.stabilize
option.sourceTEs
is going over my head a bit. Do you know what one might get from using the concatenated data vs. the optimally combined data there (totally ignoring the possibility of selecting a subset of the echoes)?I think that everything discussed here so far has been handled by #163 and #164. The only other thing I was thinking we could work on is simplifying how manually accepted components are handled. For starters, I propose that we could move --manacc
, --ctab
, and --mix
into a subgroup like "Arguments for re-running tedana" or something.
A more aggressive change (that I'm somewhat in favor of) would be to drop --ctab
and --mix
. Since the output files follow a naming convention, we know what the component table and mixing matrix files will be named when someone runs tedana
twice on the same data. In cases where a user provides --manacc
, tedana
could just grab the component table and mixing matrix from the output directory, as long as they exist. I guess it just depends on what workflow people prefer. Would they rather run tedana
on their input data with a label like "initial", manually select components, and then re-run with a label like "final" (which will generate some, but not all, of the tedana
outputs, or would they like to just run tedana
twice with the same label. We can always add relevant information to the logger to make it clear that files are being re-generated, so I don't see much of a cost to the latter approach.
Hi all, trying to track some of the changes made between BitBucket 3.2.2 version and this one. Is the current version robust to kdaw and rdaw values? And if not, see https://github.com/ME-ICA/me-ica/issues/10-- at some point the default kdaw and rdaw were changed from there to here. (Seemed like the best place to put this comment-- will move if wrong).
Thanks for confirming ! We've removed the kdaw
and rdaw
flags in https://github.com/ME-ICA/tedana/commit/69624dc91416f4046fbc1f34eff89d00faa0c243, in line with our Transparent and Reproducible Processing milestone (since it was difficult for users to form opinions on how / when to adjust those values).
The default values of 10 and 1 for kdaw
and rdaw
, respectively, are now hardcoded here when users use the kundu
option for --tedpca
(which is no longer the default). Those values appear to be the same as the ones used in the most recent version of ME-ICA, which was updated about one year ago.
Our rationale for hardcoding these values is, as @emdupre said, because they're difficult to understand for users and because we've changed the default tedpca
method to MLE dimensionality estimation. That said, power users could go into the code and edit these values as needed.
Ah, I see, the values I have are from a different branch than master, v3.2 seen here. Has the project decided to retain only behavior from v3? I realize that the project has decided to not retain backwards compatibility for ME-ICA, but I thought it would be good to make a record of why this divergence exists.
We don't support v3, although we may merge that back in in the future (as one option, not as the only method). We have shifted to only supporting v2.5 of the component selection algorithm. We have that information, and why we shifted, here in the FAQ section of the documentation site, although it's a fairly new addition.
Okay, I see, sorry about all that.
Thanks for checking on it ! It's good to have more people thinking through these choices, too :)
I think that we can close this issue now. We may ultimately want to discuss removing or reorganizing --manacc
, --ctab
, and --mix
, but I think that that would be better dealt with in a separate issue in order to prevent bloat.
In addition to the arguments we're considering modifying or adding in #161, I think we should audit several of the other arguments in the workflow.
FastICA parameters
Now that we're using
sklearn
for the ICA, we get a warning in cases of non-convergence instead of an error. Overall, I think this is a good thing, but it also obscures potential problems. The two arguments tosklearn.decomposition.FastICA
that are relevant aremax_iter
(maximum number of iterations) andtol
(convergence tolerance). The default fortol
is 0.0001, while our default is 0.000025, and is set by theconv
argument. My questions here are, isconv
something anyone ever plays with (i.e., do we actually need an argument) and is there a reason to use the much stricter default threshold we have over the default fromsklearn
? If not, then I propose that we drop theconv
argument fromtedana
and letsklearn
use its default intedica
.On a related note, the default for maximum number of iterations is 200, while
mdp
's was 5000, IIRC. The problem is thatFastICA
is failing to converge for both of our test datasets (Cornell 3-echo rest and NIH 5-echo task). I think we should probably increasemax_iter
when we callFastICA
to at least 1000, which will probably increase the amount of timetedica
takes, but will probably also still be faster thanmdp
.Other workflow parameters
I think that
--filecsdata
and--strict
are currently unused. Unless we plan to re-implement them, I think we can drop them.Does anyone use
--sourceTEs
,--denoiseTEs
, or--stabilize
?sourceTEs
ordenoiseTEs
, but neither seems very useful, unless I'm misunderstanding them.sourceTEs
: While I maybe understand switching between usingcatd
vsoptcom
for fitting models (though I've never tried the former), if we wanted to fit to just some of the echoes, why not just feed those echoes into the workflow?denoiseTEs
: This just determines whether additional files are written out. Unless they're regularly used, I propose that we incorporate this switch into a generalverbose
argument as we've been discussing in #17.stabilize
won't have an impact on the default pipeline. I also don't know if it has any meaningful impact on the results in most cases when the decision tree is used. I propose that we simply treat it as a variant of the decision tree. So, when we do split the decision tree from MLEPCA, we will have some argument like--tedpca
that can take values ofmle
,kundu
, andkundu-stabilize
, withmle
being the default. Does that sound good?