Changes to --tedpca option

Lestropie commented 3 months ago

Closes #956.

Related to #1013.

@bahmantahayori I hijacked your fork rather than creating another one; hope you don't mind.

The primary goal here is to permit the use of "--tedpca 1.0" to preclude the PCA from removing any components, rather than using the unclean and potentially threshold-sensitive eg. "--tecpca 0.999". I tried to fix up a few related things as I was doing that.

The (now deleted) line: if floatarg != int(floatarg): yields False for input "1.0", which resulted in interpretation as an integer despite very explicitly being a float.
Documentation and code was inconsistent regarding whether an integer of 1 was or was not a permissible value. I've tried to make selection of a single component permissible; it's hard to justify permitting 2 but not 1. While there's the issue with interpretation of "1" vs. "1.0", I'm hoping that the tweaks to help page and docstrings will help convey the distinction. I didn't want to make the help string even longer than it already is, so focused instead on upfront assertion of different interpretation based on type.
The input to function check_tedpca_value() was called "string", even though that function is (and is clearly designed to be) invoked outside of a command line parsing context where the type of that argument may not be a str.

To progress from draft:

[ ] Check empirical behaviour when --tedpca 1.0 is specified Ideally what I think should happen is that all non-empty components should be selected. @bahmantahayori: you indicated that with an upstream PCA denoising, TEDANA's PCA can yield zero-filled components. Can you confirm with real data that the behaviour here will be as intended; ie. those are excluded rather than "100% of variance" being interpreted as "keep every single component regardless of content"? I can sit down with you and refine code if it turns out to be necessary.

BahmanTahayori commented 3 months ago

Thanks @Lestropie. No problem. I will check the behaviour whentedpca=1.0 and will keep you posted.

codecov[bot] commented 3 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 89.80%. Comparing base (41dafe5) to head (f5a3683). Report is 7 commits behind head on main.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #1066 +/- ## ========================================== + Coverage 89.76% 89.80% +0.04% ========================================== Files 26 26 Lines 3506 3520 +14 Branches 620 621 +1 ========================================== + Hits 3147 3161 +14 Misses 211 211 Partials 148 148 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

handwerkerd commented 2 months ago

The code looks fine to me. I don't love that 1.0 (float) will output 100% variance and 1 (int) will output 1 component, but the documentation is clear and anyone who enters 1 should realize fairly quickly that's not what they want.

My only code critique is that check_tedpca_value is confusing, but you just expanded our already confusing structure. It's not necessary, but it would a bit cleaner if, after checking against valid_options it checked if valid was a string, then checked if it the string could be converted into an int or float and could then call check_float or check_int only once.

Once @BahmanTahayori confirms this works on your data, make it ready for review and I'll be fine approving.

Lestropie commented 2 months ago

I agree that the --tedpca 1 vs. --tedpca 1.0 behaviour is far from ideal. Having considered a number of alternatives I think it's the least bad of the lot. But I'm not wed to it, and am happy to take on a different interface change if there's a superior alternative.

One aspect I've thought about in this regard is having the user feedback on what the PCA is doing be a bit more upfront. I don't get my hands dirty with TEDANA myself---I'm rather just supporting @BahmanTahayori's changes---so am not familiar with the user feedback, but in my own software, wherever I see a reasonable risk of a user input being interpreted in a manner different to what they intended, I'd escalate it to a console terminal notification. This would reduce the risk of such an error being lost in lengthy log files that not all users would diligently check.

Regarding check_tedpca_value(), it's possible that the code structure is being misinterpreted. When you say:

after checking against valid_options it checked if valid was a string

, my code is already doing just the opposite of this: if it's already an integer or a float, then it does the native checking of the value, otherwise it proceeds as though it is a string. Either way, the relevant check_*() function is only ever called once per invocation of check_tedpca_value(). My best guess is that you have misinterpreted that one or both of these functions is being invoked twice, when in fact it's just that those check_*() functions are defined within the scope of the function. For any call to check_tedpca_value(), only one check_*() should be invoked, only once. So when you request:

... then checked if it the string could be converted into an int or float and could then call check_float or check_int only once.

, I think that should already be satisfied? There's just a subtle difference in that the string->float and string->integer conversions are done separately, to avoid one of the prior ambiguous behaviours.

handwerkerd commented 2 months ago

Every time I look at the code, I think of ways to do things better, but I think this PR is fine. What's happening is logged fairly clearing in the report.txt file, but is less clear in the terminal output log. The main relevant line is https://github.com/BahmanTahayori/tedana_florey/blob/55aee076ea711de86990e8a4cc09bc87c46c0eae/tedana/decomposition/pca.py#L203-L206 It will just output the number. If you wanted to distinguish between % variance explained vs number of components or add a warning if someone requests 1 component, that's where it would be added.

Otherwise, if @BahmanTahayori is fine with how this is running on your data, mark it as "Ready for review" and I'll approve.

Lestropie commented 2 months ago

The PR as it currently stands, as it turns out, does not provide the desired result.

Error message as reported by @BahmanTahayori:

File "/.../tedana/decomposition/pca.py", line 318, in tedpca
    ppca.fit(data_z)
  File "/.../lib/python3.10/site-packages/sklearn/base.py", line 1467, in wrapper
    estimator._validate_params()
  File "/.../lib/python3.10/site-packages/sklearn/base.py", line 666, in _validate_params
    validate_parameter_constraints(
  File "/.../lib/python3.10/site-packages/sklearn/utils/_param_validation.py", line 95, in validate_parameter_constraints
    raise InvalidParameterError(
sklearn.utils._param_validation.InvalidParameterError: The 'n_components' parameter of PCA must be an int in the range [0, inf), a float in the range (0.0, 1.0), a str among {'mle'} or None. Got 1.0 instead.

So the downstream requirement is specifically (0.0, 1.0) in being non-inclusive of 1.0.

We would nevertheless like for there to be a convenient mechanism by which to instruct the PCA to retain all components.

Also, I had misunderstood what was happening in the TEDANA PCA when a prior PCA denoising step had already been applied: the later components can be very close to zero power, but aren't actually exactly zero. So my desire to "select all components for which the variance explained is non-zero" doesn't actually make sense.

Here are the possible alternatives that have come to mind (not necessarily mutually exclusive):

If the user specifies --tedpca 1.0, then the value provided to the PCA for variance explained will be modified to instead be (1.0-epsilon). This should always produce the desired result of yielding all components. It may however look unusual in eg. message logs to see that the requested variance is something like 0.999999239846013285.
If the user specifies --tedpca 0, then this should be interpreted as using a number of components equal to the number of input volumes. There is some precedent in software for an argument having some alternative behaviour when a value of 0 is specified instead of a positive integer; but there may nevertheless be some dislike for the opposite difference in behaviour between --tedpca 0 and --tedpca 1. Indeed the error message quoted above indicates that an integer value of 0 is an acceptable input, in which case it might already achieve this; @BahmanTahayori can you check what behaviour that produces?
If the user wishes for the PCA to behave in this way, then they must themselves find out the number of volumes in the dataset, and then provide this as the input to --tedpca.
Alongside the existing string options ("mdl", "aic", "kic", "kundu", "kundu-stabilize"), add a new option, maybe called "all", which simply preserves all components.

I believe that my own preference would be:

If it's the case that --tedpca 0 results in preservation of all components:
- Make it possible for TEDANA to feed that input from the user to the PCA.
If not:
- Adopt 4.
- Keep some of the changes currently in this PR in terms of interpretation of int vs. float inputs (see eg. first bullet point in OP)
- Have TEDANA immediately error out if 100% of variance is requested, since it's known that the third-party PCA function will reject it; but state in that error message that --tedpca all can instead be used to achieve that purpose.

handwerkerd commented 2 months ago

I don't think your idea for --tedpca 0 or --tedpca all would work. If there are 200 volumes and you send 200 components to the ICA step, it would never converge. Particularly if you're already done a denoising step, you'll effectively have a bunch of PCA components with almost 0 variance and ICA won't be able to reliably transform them into 200 independent components. (I'm not actually trying this, so, if I'm wrong, let me know)

My understanding is the whole reason for this proposed change is, if there are 200 volumes, after denoising, it should be possible to model 100% of the variance using only X PCA components where X<<200. As such, to replicate the functionality you're going for here, your option 1 would be the best approach, but I'm not sure we need a special case to convert 1.0 to 1-epsilon The current code allows a user to input --tedpca 0.9999 which would both be clear from the command line and clear in the log. I think this is what @BahmanTahayori is currently doing. It's not ideal, but I'm not sure if there's a better option.

Thoughts?

Lestropie commented 2 months ago

Splitting PR intended software changes from intended scientific usage of such:

You're probably right in that the raison d'etre of the proposal needs to be re-thought. If it's not the case that some number of components are actually zero due to an upstream denoising step, then we probably need to start fresh on the question "if upstream denoising is performed, is one of the existing PCA rank selection mechanisms appropriate, or would some other heuristic be more appropriate". @BahmanTahayori we might need to generate data on what the various rank selection processes look like for data with & without upstream denoising. Also from our last discussion I was under the impression that you were preserving all PCA components in your own analyses, and ICA was still doing something sensible, which would be contrary to @handwerkerd's intuition. Though I could somewhat naively hypothesize that preservation of a large number of low-variance components might lead to a decrease in IQ.
Regarding the option parsing itself, I do think there's justification for some changes here, but IMO it'd be best done from scratch in a second PR. Specifically, if shooting for consistency with the intended data being passed to sklearn without any nuanced TEDANA handling:
1. --tedpca 1.0 should produce an error stating that 100% variance is not a permissible selection (as opposed to current behaviour where it's interpreted as an int)
2. If a value of (int)0 is reasonably interpreted by the PCA function, then TEDANA should permit specification of such.

Lestropie commented 2 months ago

Have had a bit more discussion with @BahmanTahayori today. I think we've got a plan for how to progress.

I will close this PR, and replace it with an alternative that does not propose any enhancement of functionality, it just resolves ambiguities with the existing code / documentation around the --redpca option. I will make --tedpca 1.0 immediately return an error given that the PCA function would itself reject that value. I will ensure that --tedpca 0 remains rejected, since that appears to yield 0 components and is therefore not a sensible use case.
@BahmanTahayori will create a PR, separate to #1013, where we propose the addition of a new PCA rank estimation heuristic to live alongside the existing string options. This will simply choose the number of components as the maximum: n_components == min(n_samples, n_features) - 1 This was expected to be the behaviour of the PCA function when n_components == None, but @BahmanTahayori has observed strange circumstances where this wasn't the case. We have observed exceptionally poor convergence of FastICA and poor results yielded when the number of ICA components is equal to the number of volumes. However, simply decreasing the number of components by 1 results in a drastic shift in ICA behaviour to being more like that when a much smaller number of components is used. This is entirely consistent with documentation on ICA usage. So we are hoping that by adding a PCA rank heuristic that simply sets the number of components to be one less than the number of volumes before ICA is run, concerns around poor ICA performance won't manifest; but are obviously open to evidence to the contrary. We will also there present evidence on suitability of different rank estimation heuristics to different data depending on how they have been pre-processed.
The RobustICA addition in #1013 should be able to proceed independently of both of the PRs described above. While it may turn out to be the case that our own usage advice would be to use RobustICA in conjunction with the rank estimation method described in point 2, we believe it should be possible to assess these two proposals independently.

handwerkerd commented 2 months ago

I'm a bit confused on wanting n_components == min(n_samples, n_features) - 1 Whenever I've ended up with that ICA has serious trouble converging since there are too many components. Even in testing #1013, when I inputted a very large number of components, many iterations of FastICA failed to converge and the robust ICA output looked wrong. I think this is because it was trying to parse the variance into too many components and the number of result components that were stable was much lower. Perhaps this works after the preceeding denoising step you're using?

Lestropie commented 2 months ago

The upstream denoising step has a pretty drastic effect on the PCA decomposition. So it's quite probable that there's a concomitant effect on ICA efficacy with large rank. Can have a more informed discussion once @BahmanTahayori presents some relevant data.

BahmanTahayori commented 2 months ago

I have analysed over 240 subjects from our centre with MPPCA applied beforehand and when n_components = n_vols-1 there is no issue with ICA convergence. Over the weekend, I analysed a few subjects without MPPCA applied beforehand with n_components = n_vols-1 and I did not observe an issue with convergence. Only for one subject, the seed was updated a few times, but it converged at the end with reasonable results when I looked at the accepted/rejected components. I can look at more subjects to see how it behaves. This is the results of our dataset preprocessed by fMRIPrep. I cannot generalise it, I can try it with other datasets and come up with a more rigorous statement. Will keep you posted.

Lestropie commented 2 months ago

On looking at the code again, I decided to refine this PR rather than creating a new one. The intent is nevertheless consistent with point 1 from comment above: to fix inconsistencies between documentation and implementation, and to ensure that --tedpca 1.0 no longer gets erroneously interpreted as requesting 1 component.

Lestropie commented 2 months ago

PR is complete and ready for review. Content if you would prefer to squash commits.

ME-ICA / tedana

Changes to --tedpca option #1066

Codecov Report