SzymonNowakowski / DMRnet

This is a development version of DMRnet — Delete or Merge Regressors Algorithms for Linear and Logistic Model Selection and High-Dimensional Data.
1 stars 0 forks source link

Apparently quadratic time complexity in number of samples (actually it is linear) #60

Closed SzymonNowakowski closed 1 year ago

SzymonNowakowski commented 1 year ago

Problem statement

There is seemingly quadratic time complexity of DMRnet and other algorithms in number of samples, $n$.

Conclusion

The computational complexity is a function of $n$ (the sample size) and $p$ (the number of variables). When one increases $n$ only, the computational complexity scales linearly, as it should.

But if $p$ is a function of $n$, $p=p(n)$, like in Insurance and some other datasets (Airbnb is another premier example of this) then it may happen, that increasing $n$ increases $p$ and, as a consequence, the computational complexity increases in $n$ polynomialy with a power higher than 1.

Profiling experiments which back up the stated conclusion

I ran some experiments to shed some more light on seemingly quadratic time complexity of DMRnet and other algorithms (quadratic time complexity in number of samples).

The experiments were run in version: 0.3.4.9001, testing_branch

library(devtools)
library(profvis)
load_all()

load("data/NA_insurance/Glamer_NaN.RData")

dim(insurance.train.10percent.x)

# [1] 5938  113

profvis({
    set.seed(0)   #to make the tests reprodicible
    mod <- DMRnet(insurance.train.10percent.x, insurance.train.10percent.y, algorithm="glamer")
})

profvis({
    set.seed(0)   #to make the tests reprodicible
    mod <- DMRnet(insurance.train.10percent.x[1:3000,], insurance.train.10percent.y[1:3000], algorithm="glamer")
})

profvis({
    set.seed(0)   #to make the tests reprodicible
    mod <- DMRnet(insurance.train.10percent.x[1:1500,], insurance.train.10percent.y[1:1500], algorithm="glamer")
})

The other computation cycles are mostly overhead related to grpreg, but the fragment that dominates and scales almost exactly quadratically is related to stats:lm.fit. What is wrong? It would seem that stats:lm.fit should scale linearly with the number of samples.

To understand that, let's try to print the number of times that part2beta_help.R (and consequently stats:lm.fit) get called. After some debugging we can see that the numbers are the following:

The answer to what is happening is the increase of $p$ (the number of variables) with increasing $n$ (the number of samples). Indeed, in Insurance dataset, initiall preprocessing resulted in $n=59,381$ observations with 5 continuous and 108 categorical variables (e.g. one of the variables concerning medical history of an individual has 579 levels) with $p=823$. After subsampling 10% of this dataset we get a sample with 5938 observations as we could see above. But a crucial point to note is that in the sample, there would be aproximately ~60 levels of the most numerous categorical variable, equivalently: in the sample that would be ~60 one-hot encoded variables in a design matrix (and some other one-hot encoded variables related to less numerous categorical variables). If one halves a number of observations, one automatically reduces not only the sample size, but a problem size as well.

To confirm that, let's scale only $n$ and let's keep $p$ constant. The following experiment is the example of this: we keep $p$ as in the last example with 1500 observations, but artificially increase the number of observations to $n=6000$. You'll notice that part2beta_help.R (and consequently stats:lm.fit) gets called 306 times in the experiment below.

profvis({
    set.seed(0)   #to make the tests reprodicible
    mod <- DMRnet(insurance.train.10percent.x[rep(1:1500, each=4),], insurance.train.10percent.y[rep(1:1500, each=4)], algorithm="glamer")
})

One can notice that the computation time spent in stats:lm.fit gets scaled linearly. And, including some additional overhead of grpreg and other compuations, we can see that the overall complexity of our algorithm is linear, as expected.

image image image