adamsardar / stoneTrees

Integrating 'omics data with biological networks by solving Steiner Tree problems
Mozilla Public License 2.0
3 stars 3 forks source link

Include CBC as a supported solver #9

Closed adamsardar closed 4 years ago

adamsardar commented 4 years ago

I always understood that SYMPHONY and CBC were essentially the same thing. I was wrong:

https://list.coin-or.org/pipermail/cbc/2012-January/000755.html

Given that @JulieBorgel has noticed some shortcomings of GLPK, it would be interesting to support CBC and compare performance.

adamsardar commented 4 years ago

MIPCL also looks promising,

http://plato.asu.edu/ftp/milp.html

Perhaps I could cook up an Rcpp wrapper at some stage. It's lesser GNU licensed, so why not?

adamsardar commented 4 years ago

Actually, MIPCL looks really promising

http://plato.asu.edu/talks/euro2019.pdf

I'm fairy surprised that there is no R binding yet

adamsardar commented 4 years ago

Unfortunately, MIPCL does not actually release the source code - there are some precompiled .so files and that's about it. I suspect that this why we see so little uptake.

https://github.com/onebitbrain/MIPCL

adamsardar commented 4 years ago

MIPCL comes as a precompiled .so file and on windows it requires an explicit installation from a MSI. I can't recommend installation of packages in such an insecure fashion, so will concentrate on CBC.

adamsardar commented 4 years ago

rcbc looks promising

adamsardar commented 4 years ago

Bad news! It looks the CBC is considerably slower than GLPK on my laptop.

See below for a benchmark of repeatedly solving repeated instances of the Steiner Forest problem (the solver was attacking the exact same problem each time).

This is very frustrating, but it shows why you need to benchmark.

image

Even more frustrating! CBC often finds better solutions than GLPK (although at colossal expense!)

solDetailsDT[,.N,by = .(RCBCvcount <= RGLPKvcount)] (i.e. number of times that the CBC solution is smaller or equal size to GLPK)

RCBCvcount N
TRUE 65
FALSE 1

solDetailsDT[,.N,by = .(RCBCvcount >= RGLPKvcount)] (i.e. number of times that the GLPK solution is smaller or equal size to CBC)

RCBCvcount N
TRUE 59
FALSE 7

There are 7 occassions where CBC has better (smaller) module solutions and 1 where GLPK trumps CBC.

Raw results (only 66 examples ran overnight):

RCBCtime RCBCvcount RGLPKtime RGLPKvcount nFixedTerminals trial
11.614 55 2.562 55 33 1
63.622 60 162.976 60 33 2
36.094 54 42.996 54 33 3
18.221 58 12.017 58 34 4
45.737 56 50.730 56 34 5
15.391 57 4.558 57 32 6
16.098 60 3.366 60 37 7
22.082 58 4.040 58 35 8
16.979 60 7.191 60 37 9
56.859 57 3.510 57 34 10
24.934 55 15.572 55 32 11
19.679 55 5.054 55 37 12
290.190 54 19.738 54 33 13
52.809 49 16.664 49 31 14
389.285 66 129.810 66 40 15
46.422 57 11.529 57 33 16
78.082 57 5.076 57 35 17
98.783 72 104.074 72 43 18
98.279 64 5.858 64 39 19
126.270 74 91.341 74 43 20
78.224 53 57.001 53 31 21
507.897 54 115.411 54 31 22
66.557 58 404.906 58 35 23
94.955 59 7.455 59 37 24
142.105 54 60.317 54 32 25
68.722 44 9.241 44 26 26
84.844 46 11.043 46 28 27
95.870 59 380.100 59 31 28
319.403 60 20.774 60 40 29
574.199 61 56.500 61 40 30
505.806 55 87.754 55 34 31
183.985 56 850.606 56 30 32
93.381 66 12.255 66 36 33
307.693 48 164.126 48 27 34
141.956 57 9.900 57 36 35
100.401 60 52.435 60 34 36
100.475 61 12.671 61 37 37
101.982 59 16.258 59 35 38
100.246 63 11.733 63 40 39
579.275 60 513.917 60 35 40
367.616 50 97.376 50 30 41
553.308 54 45.439 54 32 42
145.731 59 16.594 59 36 43
585.326 55 33.491 55 33 44
190.230 56 43.462 56 33 45
576.630 56 24.820 53 34 46
132.229 62 46.506 62 36 47
131.538 62 13.625 62 38 48
408.818 65 108.427 65 39 49
136.814 64 13.863 64 41 50
135.638 56 28.548 56 34 51
137.032 68 10.229 68 40 52
281.978 62 66.125 62 37 53
139.237 49 28.501 49 30 54
429.285 59 10.245 59 35 55
290.578 50 11.296 50 32 56
738.847 52 13.533 52 32 57
5119.158 48 21.417 57 33 58
190.331 66 10.763 66 41 59
189.824 52 12.383 52 33 60
6107.653 54 16.670 63 38 61
4883.079 41 12.458 51 29 62
4995.475 48 86.101 57 32 63
4893.707 56 31.585 62 40 64
8644.265 44 105.102 57 32 65
6660.619 53 15.883 64 40 66
library(stoneTrees)
library(igraph)
library(microbenchmark)

fixedTerminalLymphomaGraph <- lymphomaGraph
V(fixedTerminalLymphomaGraph)$isTerminal <- FALSE
V(fixedTerminalLymphomaGraph)[nodeScore > 0]$isTerminal <- TRUE
V(fixedTerminalLymphomaGraph)$nodeScore <- -1

# I notice that the Steiner forest is actually much harder than MWCS to solve - probably the symmetry of having no node-weights

testSteinFor <- nodeCentricSteinerForestProblem$new(fixedTerminalLymphomaGraph, verbose = FALSE, solverChoice = "rcbc")

set.seed(2345)
solDetails <- list()

for(i in 1:100){

  print(i)

  #Ensure that our solvers are both solving the same type of problem
  # We can't just run two bootstrap Stiener forest routines 
  testSteinFor$.__enclos_env__$private$resampleFixedTerminals()

  testSteinFor$.__enclos_env__$private$solver <- "RGLPK"
  rglpkTime <- system.time(RGLPKsol <- testSteinFor$findSingleSteinerSolution())

  testSteinFor$.__enclos_env__$private$solver <- "RCBC"
  rcbcTime <- system.time(RCBCsol <- testSteinFor$findSingleSteinerSolution()) 

  solDetails %<>%
    c(list(data.table(RCBCtime = rcpcTime["elapsed"], 
               RCBCvcount = vcount(RCBCsol),
               RGLPKtime = rglpkTime["elapsed"],
               RGLPKvcount = vcount(RGLPKsol),
               nFixedTerminals = length(V(RCBCsol)[isTerminal]))))
}

rbindlist(solDetails) %>%
  .[,trial := .I] %>%
  melt(id.vars = "trial") %>%
  .[variable %like% "time"] %>%
  ggplot(aes(x = variable, y = value/60)) +
  geom_boxplot(outlier.shape = NULL) +
  geom_jitter(aes(colour = variable)) +
  scale_y_log10() +
  theme_bw() + theme(legend.position = "none") +
  labs(x = "Solver", y = "Time (minues) [log scale]", title = "Comparison of CBC and GLPK solving single iterations ofthe Steiner Forest problem")
adamsardar commented 4 years ago

I notice that the solver is actually taking longer as more solutions are found. I believe that the previous lazy constraints are kept in the system, but maybe they should be flushed each time the fixed terminals change ...

jonnywray commented 4 years ago

do testSteinFor <- nodeCentricSteinerForestProblem$new(fixedTerminalLymphomaGraph, verbose = FALSE, solverChoice = "rcbc") inside the loop just to make sure they are all independent?

adamsardar commented 4 years ago

Yeah, good idea. I actually think that this might have found a bug though - the lazy constraints are adding significant overhead on the search ...

adamsardar commented 4 years ago

Why didn't I get a better CPU when I upgraded this laptop!!?

Will benchmark that case afterwards

jonnywray commented 4 years ago

Bug finding is always a good thing

adamsardar commented 4 years ago

Calling it a bug is too much - the constraint generation is still correct.

adamsardar commented 4 years ago

Running CBC on a single thread - there seems to be a cumulative effect here ...

Maybe GLPK is much better at decreasing the constraints than CBC?

image

adamsardar commented 4 years ago

Ahem,

I've removed the cumulative connectivity constraints in the steinerForest routine (so when the fixed terminals change, the connectivity constraints are flushed because we are dealing with a new instance of the problem).

The decrease in runtime is astonishing:

First of all, the runtime is no longer slowly increasing.

image

But also, this is the same code as above - look at the y-axis! 100x speedup!

image

Also, we don't have that strange behaviour of GLPK being worse than CBC in terms of solution quality.

RCPCtime RCPCvcount RGLPKtime RGLPKvcount nFixedTerminals trial
0.623 55 0.902 55 34 1
12.725 60 95.373 60 33 2
2.370 54 0.062 54 33 3
0.651 58 1.713 58 34 4
18.794 56 60.669 56 34 5
0.345 57 1.554 57 32 6
0.724 60 1.243 60 37 7
0.380 58 2.248 58 34 8
0.123 60 0.047 60 38 9
1.268 57 0.339 57 34 10
0.334 55 3.472 55 32 11
0.591 55 0.309 55 37 12
1.043 54 5.194 54 31 13
0.533 49 0.593 49 31 14
11.007 66 27.296 66 39 15
1.840 57 0.179 57 34 16
0.709 57 0.566 57 35 17
3.090 72 97.028 72 43 18
0.190 64 0.353 64 39 19
2.941 74 3.033 74 43 20
0.928 53 1.889 53 31 21
7.117 54 43.801 54 31 22
6.274 58 9.972 58 36 23
0.640 59 0.052 59 36 24
2.462 54 3.642 54 32 25
0.533 44 0.851 44 26 26
2.343 46 0.199 46 28 27
3.376 59 43.497 59 32 28
3.612 60 3.044 60 40 29
5.123 61 33.925 61 39 30
1.156 55 0.542 55 34 31
9.863 56 20.064 56 30 32
0.343 66 0.331 66 36 33
9.039 48 6.709 48 27 34
0.706 57 1.070 57 36 35
2.962 60 0.053 60 34 36
0.248 61 0.289 61 37 37
0.400 59 5.937 59 35 38
0.340 63 1.290 63 40 39
30.557 60 19.091 60 35 40
1.669 50 5.479 50 30 41
0.462 54 6.302 54 31 42
0.346 59 0.680 59 36 43
3.794 55 2.269 55 34 44
3.071 56 14.034 56 33 45
13.961 53 10.606 53 34 46
5.347 62 14.911 62 36 47
10.022 62 23.975 62 38 48
7.940 65 152.686 65 39 49
1.338 64 1.944 64 41 50
2.016 56 0.339 56 34 51
0.998 68 2.199 68 40 52
0.364 62 3.600 62 37 53
0.281 49 0.259 49 30 54
3.034 59 2.430 59 35 55
0.280 50 1.159 50 32 56
4.560 52 1.954 52 32 57
0.828 57 1.727 57 33 58
0.311 66 0.537 66 42 59
0.225 52 0.195 52 33 60
10.576 63 0.853 63 39 61
0.636 51 1.177 51 30 62
5.472 57 9.441 57 31 63
0.951 62 0.353 62 40 64
4.103 57 10.569 57 32 65
7.875 64 1.283 64 40 66
3.778 56 1.616 56 31 67
2.757 66 0.632 66 40 68
0.912 56 0.269 56 33 69
2.022 58 8.052 58 38 70
1.376 62 1.078 62 37 71
0.288 58 0.057 58 35 72
1.962 52 5.357 52 28 73
1.825 49 4.557 49 30 74
1.245 63 0.728 63 39 75
1.588 43 2.859 43 27 76
3.051 57 0.288 57 36 77
0.845 57 1.590 57 35 78
1.028 56 1.901 56 32 79
0.680 52 0.387 52 30 80
0.512 62 3.018 62 36 81
1.444 62 10.187 62 38 82
3.711 54 1.234 54 34 83
0.508 51 0.478 51 31 84
1.046 60 3.888 60 34 85
1.603 58 0.444 58 34 86
5.340 56 208.878 56 33 87
2.602 62 0.330 62 38 88
0.695 52 0.280 52 34 89
0.586 58 1.339 58 33 90
0.778 49 0.687 49 28 91
4.499 52 2.590 52 33 92
4.531 62 10.925 62 37 93
0.878 57 12.693 57 34 94
1.452 59 53.015 59 33 95
0.323 55 0.394 55 34 96
0.434 49 4.607 49 31 97
2.391 68 0.714 68 40 98
0.996 51 3.490 51 33 99
0.284 54 1.820 54 35 100
jonnywray commented 4 years ago

Interesting - so both speed up, but CBC by more. Curious to see what those changes do to the problems that seems intractable previously

adamsardar commented 4 years ago

It depends on why they were intractable - single Steiner tree problems (or collection of sub-optimal solutions), this will make no difference. But for the Steiner Forest problem the improvement is huge.

adamsardar commented 4 years ago

Some more benchmarks! Whereas previously I was running the exact same minimum Steiner tree problem, I am not trialling 100 iterations of the full Steiner forest process.

Runtime

Making repeated calls of the form:

  rcbcTime <- system.time(RCBCsol <- nodeCentricSteinerForestProblem$new(fixedTerminalLymphomaGraph, verbose = FALSE, solverChoice = "rcbc")$sampleMultipleBootstrapSteinerSolutions())

(i.e. running the Steiner forest routine using the defaults; 5xbootstraps and 0xiterations for sub-optimal solutions)

image

CBC seems a little faster (although there isn't that much in it - the speedups have been algorithmic, not solver based).

However ...

Module size

The resultant modules from CBC are larger than those from GLPK. There are two reasons why this might happen - GLPK might just find better (smaller) solutions or CBC might find more diverse solutions. I'll need to run another benchmark where I collect the module sizes in the solution pool (which are aggregated to give the result).

image

Next

The question is now of whether GLPK or CBC find better (smaller) and/or more diverse solutions. This will be answered by my next benchmark.

adamsardar commented 4 years ago

Last benchmark; the Steiner forest process with 100 bootstraps (i.e. resampling the seed set 100x) and collecting up to five degenerate solutions on each run. I ran the process 15 times (I wanted to finish it overnight and the numbers below seem conclusive). Whilst each of these 15x2=30 runs have different sub-problems (the sampling step is stochastic), comparison of means is still reasonable. This stands in contrast to the first benchmark, where exactly the same Steiner tree problem was solved.

Running something like:

glpktim <- system.time( SteinForGLPK <- nodeCentricSteinerForestProblem$new(fixedTerminalLymphomaGraph, solverChoice = "RGLPK", verbose = FALSE)$sampleMultipleBootstrapSteinerSolutions(nBootstraps = 100, maxItr = 5) )

  cbctime <- system.time( SteinForCBC <- nodeCentricSteinerForestProblem$new(fixedTerminalLymphomaGraph, solverChoice = "RCBC", verbose = FALSE)$sampleMultipleBootstrapSteinerSolutions(nBootstraps = 100, maxItr = 5) )

First of all - CBC wins on time:

image

Over an hour to only a few minutes!!

But as before, the CBC aggregated modules are larger:

image

Inspecting the individual modules collected (i.e. the solutions to the sub-problem of resampled terminals) the solutions seem equivalent (in size).

image

The reason for the larger aggregated modules under CBC is simply that the solver is able to find far more solutions to the subproblems:

image

So there we have it: CBC is much faster and finds many more solutions. This is sufficient in my mind to change the recommended solver.