Alternative(s) to multiple linear regression

0todd0000 / spm1d

One-Dimensional Statistical Parametric Mapping in Python

GNU General Public License v3.0

61 stars 21 forks source link

Alternative(s) to multiple linear regression #81

Closed depierie closed 5 years ago

depierie commented 6 years ago

Hi Mark, Todd, and Jos,

I wanted to ask your opinion on the best statistical test that I could use for my data within spm1d. I'm analyzing the magnitude of the hip contact forces (so 1 dependent variable) for a cohort of 150 patients, that we would like to stratify for different parameters ( age, BMI, and gait speed: all numerical and not categorical). Initially I wanted to use a three-way ANOVA, but realized that the data did not fit into a balanced design, and I saw that the multiple regression option has not been implemented yet within spm1d.

I guess the simplest solution would be separating the analysis for the three parameters (either ANOVAs or, even better, simple regressions) and adjusting the significance level with a correction factor for multiple testing, but I don't know how legitimate this assumption would be. I wouldn't get any information on interactions within the 3 parameters, but I would still get an estimate of the global effect of each parameter on the contact forces. Additionally, I feel this would be the most intuitive way to present the results. What's your opinion about the validity of this approach? Or would you have any other suggestion?

Thank you very much for the help! Enrico

0todd0000 commented 6 years ago

Hi Enrico,

One approach would be to use random subsets to force ANOVA balance. You could randomly remove some subjects so that there is balance, run ANOVA, then repeat for many different random subsets. Provided the results don't change qualitatively I think it would be fine to report all F statistics, or even just one representative result, and then describe the random subset approach in the text. However, since the variables are continuous, it would probably best to use ANOVA only if there are clearly distinguishable groups of age, BMI and speed.

Another option is multiple regression, which can be implemented using a general linear model (spm1d.stats.glm), like this:

X      = np.zeros(150,4)  #empty design matrix
X[:,0] = 1     #intercept
X[:,1] = age   #age, bmi and speed are 150-element vectors containing floats
X[:,2] = bmi
X[:,3] = speed

c      = np.array( [0,0,0,1] )  #contrast vector (for speed)

t      = spm1d.stats.glm(Y, X, c)  #Y is a (150 x Q) data array, where Q is the number of continuum nodes
ti     = t.inference(0.05)

This will test for linear effects of speed, after accounting for linear effects of age and BMI, and also after including an intercept. Depending on the ranges of age, BMI, and speed, it is quite possible that one or more effects are better modeled as nonlinear. For an example of implementing nonlinear effects see the example file: ./spm1d/examples/stats1d/ex_glm.py

Todd

depierie commented 6 years ago

Hi Todd, Thank you so much for your quick answer!

I was looking at the GLM option before and I managed to set up the design matrix and run the test in Python, similarly to the example, but I didn't quite understand how to set up the contrasts in order to obtain information on the influence of every single independent variable.

Should I run multiple tests with the contrast vectors respectively defined as [0,1,0,0], [0,0,1,0], [0,0,0,1]? If so, would I need to apply a correction factor to alpha for multiple testing?

And how would these results differ from three simple linear regressions? The interaction between variables would already be accounted for and "removed" from the quantification of the effect of the single variable of interest?

Thank you again! Cheers, Enrico

0todd0000 commented 6 years ago

Hi Enrico,

I was looking at the GLM option before and I managed to set up the design matrix and run the test in Python, similarly to the example, but I didn't quite understand how to set up the contrasts in order to obtain information on the influence of every single independent variable.

spm1d.stats.glm only supports simple t contrasts like the one above. If you're interested in setting up and testing arbitrarily complex contrasts please refer to the literature, and/or please consider using software that directly supports arbitrary GLM contrasts like SPM12 (http://www.fil.ion.ucl.ac.uk/spm/software/spm12/). Here are two useful sites discussing GLM contrasts: http://brainvoyager.com/bvqx/doc/UsersGuide/StatisticalAnalysis/TheGeneralLinearModel.html http://psych.colorado.edu/~carey/Courses/PSYC7291/handouts/glmtheory.pdf

Should I run multiple tests with the contrast vectors respectively defined as [0,1,0,0], [0,0,1,0], [0,0,0,1]? If so, would I need to apply a correction factor to alpha for multiple testing?

It depends on your hypothesis / hypotheses. If you are primarily interested in speed then you needn't test directly for the other factors, and no corrections for the critical threshold are needed. If your hypotheses are more complex than that, including for example: no interaction amongst the factors, then you'll need to set up and test more complex contrasts.

And how would these results differ from three simple linear regressions? The interaction between variables would already be accounted for and "removed" from the quantification of the effect of the single variable of interest?

Multiple regression differs from simple regression because it models multiple linear effects simultaneously. Thus effects associated with speed, for example, are those that remain after linear effects of age and BMI are removed. Interaction analysis requires proper contrast specification.

Todd

depierie commented 6 years ago

Hi Todd, Thanks again for the help.

The links you sent me were actually quite useful to understand the theory behind formulating contrasts, but they seem to be focusing more on comparing different levels within the same independent variable/regressor, while I'm still not sure how to evaluate the influence of different regressors. These slides ( www.fil.ion.ucl.ac.uk/mfd_archive/2009/1stlevel.ppt ) are going in that direction, with definitions of t- and F-contrasts, but I'm still having some doubts.

Maybe I should reformulate my research question more clearly. The hypotheses we would like to test are the following:

Does age have an influence (linear effect) on HCF
Does bmi have an influence (linear effect) on HCF
Does gait speed have an influence (linear effect) on HCF

So the third hypothesis would be tested with the example you suggested and a contrast vector of [0,0,0,1]. I was wondering if I could run similar additional separate t-contrasts for age and BMI (with contrast vectors therefore defined as [0,1,0,0], [0,0,1,0] that I was mentioning before). Do you think this would be a correct or wrong approach? Or at least partially correct one that provides a solution with some limited validity of interpretation? I've quickly run these three separate t-contrasts for the GLM defined above on spm1d and it shows a clear influence of speed, a small and limited influence of bmi and no influence of age, which I believe would be very interesting results for the biomech community, if they could be reported at least in qualitative terms by stating the limitations of the assumptions made. Speed: BMI: Age:

Thank you again! Cheers, Enrico

0todd0000 commented 6 years ago

The results look good, but it's difficult to say whether this approach is appropriate, because there are a variety of potential problems. Here are some points to consider:

The contrast vector [0,0,0,1] tests a hypothesis similar to the third hypothesis, but it also accounts for linear effects of the other factors.
If those are truly your hypotheses, then I think they can only be tested with three independent datasets. Only one factor appears in each hypothesis, implying that they are independent from each other. If they instead represent a family of hypotheses to be tested simultaneously, that needs to be specified. In that case I think this multiple regression approach is correct, however...
spm1d.stats.glm currently only supports single, simple t contrasts like [0,0,0,1], because only those types of contrasts have been tested thoroughly. Here zeros represent nuisance factors and the one represents the factor of interest. It may also work correctly for the multiple regression approach you have used, but I can't say for sure until it is tested thoroughly, over a range of problems, sample sizes and variance structures.
One way to check whether the results yielded by spm.stats.glm are correct for arbitrary designs and datasets is to find a 0D control dataset which matches your design and which has known results. For example, you should be able to find a control dataset in another package like R or SPSS, or you could even submit your data (from a single time point like time=50%) to another software package. You can then submit the same 0D data to spm1d.stats.glm, try different contrast vectors and check if the results agree with the control results. I suspect that the spm1d results will agree with control datasets only for the case of single t contrasts, and possibly also for multiple regression (like your use above) if all variables are well-behaved.

I'm sorry I cannot give a definitive answer, but the questions spans beyond the realm of what spm1d currently supports.

With apologies, Todd

depierie commented 6 years ago

Hi Todd, Thank you for all the clarifications and input. I will try to clear a bit my head now based on all the information I've gathered and try to come up with an appropriate solution. I will keep you updated, especially in case I decide to test spm1d with some 0D sample data, as I guess this information could be somehow useful for you as well for future testing of the package.

Cheers, Enrico

limbicode214 commented 4 years ago

Hi there

We are performing a SPM1D analysis (38 subjects) with lumbar curvature angle during a lifting maneveur as the dependent variable and a questionnaire score as independent. Because age, gender, size etc. might influence the outcome we would like to integrate these as nuisance variables in a multiple regression model. I've adapted the model mentioned above:

X = np.zeros((38,8)) #empty design matrix X[:,0] = regressor.iloc[:,0] #regressor of interest X[:,1] = 1 #intercept X[:,2] = nuisance.iloc[:,0] #age X[:,3] = nuisance.iloc[:,1] #male X[:,4] = nuisance.iloc[:,2] #female X[:,5] = nuisance.iloc[:,3] #weight X[:,6] = nuisance.iloc[:,4] #bendROM X[:,7] = nuisance.iloc[:,5] #ratio leg/size

c = np.array( [1,0,0,0,0,0,0,0] ) #contrast vector (for score of interest)

t = spm1d.stats.glm(Y, X, c)
ti = t.inference(0.05)

I was wondering about the type of inclusion of the independent variables in the model as I did not find any info about it. Is it using the "ENTER" method or does the order of the independent variables have any influence (e.g. stepwise) ?

Thanks for your help, michael

0todd0000 commented 4 years ago

Replied in #109