There is seemingly quadratic time complexity of DMRnet and other algorithms in number of samples, $n$.
Conclusion
The computational complexity is a function of $n$ (the sample size) and $p$ (the number of variables). When one increases $n$ only, the computational complexity scales linearly, as it should.
But if $p$ is a function of $n$, $p=p(n)$, like in Insurance and some other datasets (Airbnb is another premier example of this) then it may happen, that increasing $n$ increases $p$ and, as a consequence, the computational complexity increases in $n$ polynomialy with a power higher than 1.
Profiling experiments which back up the stated conclusion
I ran some experiments to shed some more light on seemingly quadratic time complexity of DMRnet and other algorithms (quadratic time complexity in number of samples).
The experiments were run in version: 0.3.4.9001, testing_branch
library(devtools)
library(profvis)
load_all()
load("data/NA_insurance/Glamer_NaN.RData")
dim(insurance.train.10percent.x)
# [1] 5938 113
profvis({
set.seed(0) #to make the tests reprodicible
mod <- DMRnet(insurance.train.10percent.x, insurance.train.10percent.y, algorithm="glamer")
})
profvis({
set.seed(0) #to make the tests reprodicible
mod <- DMRnet(insurance.train.10percent.x[1:3000,], insurance.train.10percent.y[1:3000], algorithm="glamer")
})
profvis({
set.seed(0) #to make the tests reprodicible
mod <- DMRnet(insurance.train.10percent.x[1:1500,], insurance.train.10percent.y[1:1500], algorithm="glamer")
})
~6000 rows: computation cycles: ~300 000, spent mostly in part2beta_help.R and specifically in stats:lm.fit in part2beta_help.R (~200 000 cycles)
3000 rows: computation cycles: ~100 000, spent mostly in part2beta_help.R and specifically in stats:lm.fit in part2beta_help.R (~50 000 cycles)
1500 rows: computation cycles: ~44 000, spent mostly in part2beta_help.R and specifically in stats:lm.fit in part2beta_help.R (~14 000 cycles)
The other computation cycles are mostly overhead related to grpreg, but the fragment that dominates and scales almost exactly quadratically is related to stats:lm.fit. What is wrong? It would seem that stats:lm.fit should scale linearly with the number of samples.
To understand that, let's try to print the number of times that part2beta_help.R (and consequently stats:lm.fit) get called. After some debugging we can see that the numbers are the following:
~6000 rows: 439 times
3000 rows: 360 times
1500 rows: 306 times
The answer to what is happening is the increase of $p$ (the number of variables) with increasing $n$ (the number of samples). Indeed,
in Insurance dataset, initiall preprocessing resulted in $n=59,381$ observations with 5 continuous and 108 categorical variables (e.g. one of the variables concerning medical history of an individual has 579 levels) with $p=823$. After subsampling 10% of this dataset we get a sample with 5938 observations as we could see above. But a crucial point to note is that in the sample, there would be aproximately ~60 levels of the most numerous categorical variable, equivalently: in the sample that would be ~60 one-hot encoded variables in a design matrix (and some other one-hot encoded variables related to less numerous categorical variables). If one halves a number of observations, one automatically reduces not only the sample size, but a problem size as well.
To confirm that, let's scale only $n$ and let's keep $p$ constant. The following experiment is the example of this: we keep $p$ as in the last example with 1500 observations, but artificially increase the number of observations to $n=6000$. You'll notice that part2beta_help.R (and consequently stats:lm.fit) gets called 306 times in the experiment below.
profvis({
set.seed(0) #to make the tests reprodicible
mod <- DMRnet(insurance.train.10percent.x[rep(1:1500, each=4),], insurance.train.10percent.y[rep(1:1500, each=4)], algorithm="glamer")
})
One can notice that the computation time spent in stats:lm.fit gets scaled linearly. And, including some additional overhead of grpreg and other compuations, we can see that the overall complexity of our algorithm is linear, as expected.
Problem statement
There is seemingly quadratic time complexity of DMRnet and other algorithms in number of samples, $n$.
Conclusion
The computational complexity is a function of $n$ (the sample size) and $p$ (the number of variables). When one increases $n$ only, the computational complexity scales linearly, as it should.
But if $p$ is a function of $n$, $p=p(n)$, like in Insurance and some other datasets (Airbnb is another premier example of this) then it may happen, that increasing $n$ increases $p$ and, as a consequence, the computational complexity increases in $n$ polynomialy with a power higher than 1.
Profiling experiments which back up the stated conclusion
I ran some experiments to shed some more light on seemingly quadratic time complexity of DMRnet and other algorithms (quadratic time complexity in number of samples).
The experiments were run in version:
0.3.4.9001
,testing_branch
~6000 rows: computation cycles: ~300 000, spent mostly in
part2beta_help.R
and specifically instats:lm.fit
inpart2beta_help.R
(~200 000 cycles)3000 rows: computation cycles: ~100 000, spent mostly in
part2beta_help.R
and specifically instats:lm.fit
inpart2beta_help.R
(~50 000 cycles)1500 rows: computation cycles: ~44 000, spent mostly in
part2beta_help.R
and specifically instats:lm.fit
inpart2beta_help.R
(~14 000 cycles)The other computation cycles are mostly overhead related to
grpreg
, but the fragment that dominates and scales almost exactly quadratically is related tostats:lm.fit
. What is wrong? It would seem thatstats:lm.fit
should scale linearly with the number of samples.To understand that, let's try to print the number of times that
part2beta_help.R
(and consequentlystats:lm.fit
) get called. After some debugging we can see that the numbers are the following:The answer to what is happening is the increase of $p$ (the number of variables) with increasing $n$ (the number of samples). Indeed, in Insurance dataset, initiall preprocessing resulted in $n=59,381$ observations with 5 continuous and 108 categorical variables (e.g. one of the variables concerning medical history of an individual has 579 levels) with $p=823$. After subsampling 10% of this dataset we get a sample with 5938 observations as we could see above. But a crucial point to note is that in the sample, there would be aproximately ~60 levels of the most numerous categorical variable, equivalently: in the sample that would be ~60 one-hot encoded variables in a design matrix (and some other one-hot encoded variables related to less numerous categorical variables). If one halves a number of observations, one automatically reduces not only the sample size, but a problem size as well.
To confirm that, let's scale only $n$ and let's keep $p$ constant. The following experiment is the example of this: we keep $p$ as in the last example with 1500 observations, but artificially increase the number of observations to $n=6000$. You'll notice that
part2beta_help.R
(and consequentlystats:lm.fit
) gets called 306 times in the experiment below.One can notice that the computation time spent in
stats:lm.fit
gets scaled linearly. And, including some additional overhead ofgrpreg
and other compuations, we can see that the overall complexity of our algorithm is linear, as expected.