RfastOfficial / Rfast

A collection of Rfast functions for data analysis. Note 1: The vast majority of the functions accept matrices only, not data.frames. Note 2: Do not have matrices or vectors with have missing data (i.e NAs). We do no check about them and C++ internally transforms them into zeros (0), so you may get wrong results. Note 3: In general, make sure you give the correct input, in order to get the correct output. We do no checks and this is one of the many reasons we are fast.
143 stars 19 forks source link

lmfit cannot deal with ill-conditioned design matrices #37

Closed frederikziebell closed 3 years ago

frederikziebell commented 3 years ago

Consider the example

n <- 50
x <- seq(1,500,len=50)
X <- cbind(1,x,x^2,x^3)
colnames(X) <- c("Intercept","x","x2","x3")
beta <- matrix(c(1,1,1,1),4,1)
set.seed(1)
y <- X%*%beta+rnorm(n,sd=1)
Rfast::lmfit(X,y)

which gives an error that the system is computationally singular. It's better to use a QR decomposition than directly inverting crossprod(X), i.e. replace solve(crossprod(X), crossprod(X, y)) with solve(qr(X),y). If there is a vector of weights w or in general a weights matrix W, one has to convert the linear model first to a homoscedastic one via an eigen-decomposition as done in MASS::lm.gls

statlink commented 3 years ago

Hi frederikziebell.

If we were to make it via QR decomposition we would not achieve a high speed. You can see the code and regognize that it does not handle such cases as you mentioned. The purpose of Rfast is to make R fast. We let the user do the checks for his data. We give something that is bare bones and fast.

Michail

frederikziebell commented 3 years ago

Is is also slower

library("microbenchmark")
n <- 400
set.seed(1)
X <- matrix(rnorm(n^2),ncol=n)
beta <- 1:n
Y <- X%*%beta
microbenchmark(
  Rfast::lmfit(X,Y),
  solve(qr(X),Y),
  stats::lm.fit(X,Y)
)

image

statlink commented 3 years ago

Yes, I will agree with you. I also tested in an old laptop.

But, what kind of an example is this? Is this a realistic example? Who would perform a linear regression model with 400 variables if they had only 400 observations? Statistically it does not make too much sense. What if you try the same example with 50 variables? Is lmfit still slower? I think not.

But, if this example suits you, then fine by me.