0todd0000 / spm1d

One-Dimensional Statistical Parametric Mapping in Python
GNU General Public License v3.0
63 stars 21 forks source link

SPM1D Error Zero variance detected #174

Closed bpuladi closed 2 years ago

bpuladi commented 3 years ago

First of all many thanks for this great Python and MATLAB package! Really first class!

When using both the Matlab and Python version, I noticed that when using regression, but also for some other statistical methods, the Python version gives an error message: SPM1DError: Zero variance detected

This is not the case in the MATLAB version.

bpuladi commented 3 years ago

Addendum: At: https://github.com/0todd0000/spm1d/blob/master/spm1d/stats/_datachecks.py Commented out line 144 and it now works as in MATLAB. Do nodes with zero variance matter at all in linear regression?

0todd0000 commented 3 years ago

Hello, thank you for reporting this issue. That message will appear if all values at a single time point are the same, usually when all values are zeros. For example, in Python this would generate that error:

y       = np.random.randn(8, 101)
y[:,50] = 1   # this makes the 51st node have zero variance
t       = spm1d.stats.ttest(y)

Thorough data checks, including zero-variance checks, are currently implemented only in the Python version. If you use the same data in the MATLAB version, this error won't be generated, but the test statistic will be undefined (probably with "nan" values) at the zero-variance locations.

0todd0000 commented 3 years ago

Just saw your addendum:

Do nodes with zero variance matter at all in linear regression?

Yes, this will generate the same error:

x       = np.random.randn(8)
y       = np.random.randn(8, 101)
y[:,50] = 1   # this makes the 51st node have zero variance
t       = spm1d.stats.regress(y, x)
bpuladi commented 3 years ago

Thanks for your quick reply! My first solution was to remove the nodes that had a zero variance from the raw data. However, the problem is that this shifted all the spm1d results and the clusters no longer match the raw data. It would be nice if checking for zero variance could be optional.

0todd0000 commented 3 years ago

Zero-variance nodes can cause a variety of problems, especially if there are lots of them.

If there are many zero variance nodes, I am concerned that spm1d might actually not be suitable for analyzing your data...

Could you share the data you are analyzing? If you are unable to share a data file, can you attach a figure or two which depict the data and the region(s) of zero variance?

0todd0000 commented 3 years ago

(The following discussion may not be related directly to your data, but should be relevant in general to this zero-variance issue.)

Consider the ground reaction force (GRF) for one foot over the full gait cycle. GRF is greater than zero only when the foot is in contact with the ground, and there is a relatively large region of time --- relative to the full gait cycle --- where GRF is zero, by physical definition. It is not scientifically meaningful to ask questions about GRF during the non-contact phase, because by physical definition there can be no experimental effects during non-contact. By extension, statistical tests mustn't be conducted on GRF during this non-contact phase. Non-contact GRF regions must be excluded from analyses.

To exclude these zero-variance regions from analysis, there are two basic options:

  1. Segment the data: temporally segment the data at the stance endpoints, as defined for example by the instants where GRF exceeds some small threshold like 20 N. There are a variety of other, more complex segmentation algorithms in the literature.
  2. Define a region of interest (ROI): see this article which explains how to exclude non-ROI regions from analysis.

One must choose either (1) or (2) because including these data will corrupt a variety of SPM quantities including especially: temporal length, and temporal smoothness.

So please exclude all GRF=0 regions (and all zero variance regions) from analysis.

bpuladi commented 3 years ago

Thank you very much!

I have now proceeded as follows. First I checked for each node if zero variance was not present. Next I added a mean threshold of 5 N to select only nodes with a minimum activation. Afterwards I used AND to combine both Boolean arrays and passed them as ROI. It must be said that this does not work without deactivating the fixed zero variance check, because the zero variance check is also done outside the ROI.

bpuladi commented 3 years ago

I still have one question, in the booklet it is mentioned that the SPM.r can also be reported. Would you also include this in the values per node or as an absolute value?

0todd0000 commented 3 years ago

I have now proceeded as follows. First I checked for each node if zero variance was not present. Next I added a mean threshold of 5 N to select only nodes with a minimum activation. Afterwards I used AND to combine both Boolean arrays and passed them as ROI. It must be said that this does not work without deactivating the fixed zero variance check, because the zero variance check is also done outside the ROI.

Thank you for reporting the zero variance check. Relevant to this point, and this entire thread: spm1d.stats procedures require that all data are both (i) segmented and (ii) registered prior to analysis. The Python checks for zero variance are meant to alert the user to potential inappropriate segmentation, but this is just a convenience check for users. Appropriate segmentation is a prerequisite for analysis.

For your dataset it sounds like segmentation was conducted, but extracted full gait cycles rather than just stance phase (or another sub-phase). If this is true, it would imply that the applied segmentation procedure is inappropriate, and by extension that registration may also be inappropriate. The ROI procedure you described is like a pseudo-segmentation in that it operates on pre-registered data and applies a constant segment to all observations; segmentation should actually operate on individual observations.

Overall this may or may not affect the final results. Regardless, when using spm1d.stats in the future, please ensure that all observations are (i) segmented, then (ii) registered, then (iii) analyzed.



I still have one question, in the booklet it is mentioned that the SPM.r can also be reported. Would you also include this in the values per node or as an absolute value?

It may be better to report the values per node. Positive and negative r values indicate positive and negative correlation, respectively, so using the absolute value will hide the correlation direction. In my opinion the t statistic (SPM.z) is more useful to report because its range is [-inf, +inf], so large values are easy to perceive. The r statistic follows the identical pattern to the t statistic, but is compressed to the range [-1, +1], and it is generally difficult to perceive systematic changes in r when r > 0.9. A second problem with r is that conventional correlation strength interpretations (e.g. r=0.8 implies "strong correlation") are inappropriate for 1D data.

bpuladi commented 3 years ago

Thank you very much for the feedback. :-)

The data was segmented based on gait cycle per leg, registered using interpolation and then analyzed using defined ROI (threshold > 5 N and no zero variance) using regression. I have also read your publication on ROIs: Pataky, Todd C.; Robinson, Mark A.; Vanrenterghem, Jos (2016): Region-of-interest analyses of one-dimensional biomechanical trajectories: bridging 0D and 1D theory, augmenting statistical power. In: PeerJ 4, e2652. DOI: 10.7717/peerj.2652.

However, it is not clear to me why only the segmentation of subphases is allowed and not the segmentation of whole gait cycles by means of the mentioned segmentation procedure.

0todd0000 commented 3 years ago

The data was segmented based on gait cycle

This segmentation is appropriate only for variables / effects which are defined for the entire gait cycle. Variables like joint angles are fine because they are non-null for the whole gait cycle. Variables like GRF are not OK because they are null for a substantial portion of the gait cycle.

However, it is not clear to me why only the segmentation of subphases is allowed and not the segmentation of whole gait cycles by means of the mentioned segmentation procedure.

Because this segmentation procedure results in a discontinuous domain, or equivalently: a piecewise continuous domain. In this case "domain" = time, and "piecewise continuous domain" implies that a given variable (e.g. GRF) varies continuously across specific subdomains, but not across the entire domain. Within domain discontinuities, effects are undefined, so it is not possible to experimentally assess effects within discontinuities.

spm1d handles domain discontinuities only through manual ROI definition. Other procedures (e.g. from the FDA literature) are designed to algorithmically handle domain discontinuities.