Include CBC as a supported solver

adamsardar commented 4 years ago

I always understood that SYMPHONY and CBC were essentially the same thing. I was wrong:

https://list.coin-or.org/pipermail/cbc/2012-January/000755.html

Given that @JulieBorgel has noticed some shortcomings of GLPK, it would be interesting to support CBC and compare performance.

adamsardar commented 4 years ago

MIPCL also looks promising,

http://plato.asu.edu/ftp/milp.html

Perhaps I could cook up an Rcpp wrapper at some stage. It's lesser GNU licensed, so why not?

adamsardar commented 4 years ago

Actually, MIPCL looks really promising

http://plato.asu.edu/talks/euro2019.pdf

I'm fairy surprised that there is no R binding yet

adamsardar commented 4 years ago

Unfortunately, MIPCL does not actually release the source code - there are some precompiled .so files and that's about it. I suspect that this why we see so little uptake.

https://github.com/onebitbrain/MIPCL

adamsardar commented 4 years ago

MIPCL comes as a precompiled .so file and on windows it requires an explicit installation from a MSI. I can't recommend installation of packages in such an insecure fashion, so will concentrate on CBC.

adamsardar commented 4 years ago

rcbc looks promising

adamsardar commented 4 years ago

Bad news! It looks the CBC is considerably slower than GLPK on my laptop.

See below for a benchmark of repeatedly solving repeated instances of the Steiner Forest problem (the solver was attacking the exact same problem each time).

This is very frustrating, but it shows why you need to benchmark.

Even more frustrating! CBC often finds better solutions than GLPK (although at colossal expense!)

solDetailsDT[,.N,by = .(RCBCvcount <= RGLPKvcount)] (i.e. number of times that the CBC solution is smaller or equal size to GLPK)

RCBCvcount	N
TRUE	65
FALSE	1

solDetailsDT[,.N,by = .(RCBCvcount >= RGLPKvcount)] (i.e. number of times that the GLPK solution is smaller or equal size to CBC)

RCBCvcount	N
TRUE	59
FALSE	7

There are 7 occassions where CBC has better (smaller) module solutions and 1 where GLPK trumps CBC.

Raw results (only 66 examples ran overnight):

RCBCtime	RCBCvcount	RGLPKtime	RGLPKvcount	nFixedTerminals	trial
11.614	55	2.562	55	33	1
63.622	60	162.976	60	33	2
36.094	54	42.996	54	33	3
18.221	58	12.017	58	34	4
45.737	56	50.730	56	34	5
15.391	57	4.558	57	32	6
16.098	60	3.366	60	37	7
22.082	58	4.040	58	35	8
16.979	60	7.191	60	37	9
56.859	57	3.510	57	34	10
24.934	55	15.572	55	32	11
19.679	55	5.054	55	37	12
290.190	54	19.738	54	33	13
52.809	49	16.664	49	31	14
389.285	66	129.810	66	40	15
46.422	57	11.529	57	33	16
78.082	57	5.076	57	35	17
98.783	72	104.074	72	43	18
98.279	64	5.858	64	39	19
126.270	74	91.341	74	43	20
78.224	53	57.001	53	31	21
507.897	54	115.411	54	31	22
66.557	58	404.906	58	35	23
94.955	59	7.455	59	37	24
142.105	54	60.317	54	32	25
68.722	44	9.241	44	26	26
84.844	46	11.043	46	28	27
95.870	59	380.100	59	31	28
319.403	60	20.774	60	40	29
574.199	61	56.500	61	40	30
505.806	55	87.754	55	34	31
183.985	56	850.606	56	30	32
93.381	66	12.255	66	36	33
307.693	48	164.126	48	27	34
141.956	57	9.900	57	36	35
100.401	60	52.435	60	34	36
100.475	61	12.671	61	37	37
101.982	59	16.258	59	35	38
100.246	63	11.733	63	40	39
579.275	60	513.917	60	35	40
367.616	50	97.376	50	30	41
553.308	54	45.439	54	32	42
145.731	59	16.594	59	36	43
585.326	55	33.491	55	33	44
190.230	56	43.462	56	33	45
576.630	56	24.820	53	34	46
132.229	62	46.506	62	36	47
131.538	62	13.625	62	38	48
408.818	65	108.427	65	39	49
136.814	64	13.863	64	41	50
135.638	56	28.548	56	34	51
137.032	68	10.229	68	40	52
281.978	62	66.125	62	37	53
139.237	49	28.501	49	30	54
429.285	59	10.245	59	35	55
290.578	50	11.296	50	32	56
738.847	52	13.533	52	32	57
5119.158	48	21.417	57	33	58
190.331	66	10.763	66	41	59
189.824	52	12.383	52	33	60
6107.653	54	16.670	63	38	61
4883.079	41	12.458	51	29	62
4995.475	48	86.101	57	32	63
4893.707	56	31.585	62	40	64
8644.265	44	105.102	57	32	65
6660.619	53	15.883	64	40	66

library(stoneTrees)
library(igraph)
library(microbenchmark)

fixedTerminalLymphomaGraph <- lymphomaGraph
V(fixedTerminalLymphomaGraph)$isTerminal <- FALSE
V(fixedTerminalLymphomaGraph)[nodeScore > 0]$isTerminal <- TRUE
V(fixedTerminalLymphomaGraph)$nodeScore <- -1

# I notice that the Steiner forest is actually much harder than MWCS to solve - probably the symmetry of having no node-weights

testSteinFor <- nodeCentricSteinerForestProblem$new(fixedTerminalLymphomaGraph, verbose = FALSE, solverChoice = "rcbc")

set.seed(2345)
solDetails <- list()

for(i in 1:100){

  print(i)

  #Ensure that our solvers are both solving the same type of problem
  # We can't just run two bootstrap Stiener forest routines 
  testSteinFor$.__enclos_env__$private$resampleFixedTerminals()

  testSteinFor$.__enclos_env__$private$solver <- "RGLPK"
  rglpkTime <- system.time(RGLPKsol <- testSteinFor$findSingleSteinerSolution())

  testSteinFor$.__enclos_env__$private$solver <- "RCBC"
  rcbcTime <- system.time(RCBCsol <- testSteinFor$findSingleSteinerSolution()) 

  solDetails %<>%
    c(list(data.table(RCBCtime = rcpcTime["elapsed"], 
               RCBCvcount = vcount(RCBCsol),
               RGLPKtime = rglpkTime["elapsed"],
               RGLPKvcount = vcount(RGLPKsol),
               nFixedTerminals = length(V(RCBCsol)[isTerminal]))))
}

rbindlist(solDetails) %>%
  .[,trial := .I] %>%
  melt(id.vars = "trial") %>%
  .[variable %like% "time"] %>%
  ggplot(aes(x = variable, y = value/60)) +
  geom_boxplot(outlier.shape = NULL) +
  geom_jitter(aes(colour = variable)) +
  scale_y_log10() +
  theme_bw() + theme(legend.position = "none") +
  labs(x = "Solver", y = "Time (minues) [log scale]", title = "Comparison of CBC and GLPK solving single iterations ofthe Steiner Forest problem")

adamsardar commented 4 years ago

I notice that the solver is actually taking longer as more solutions are found. I believe that the previous lazy constraints are kept in the system, but maybe they should be flushed each time the fixed terminals change ...

jonnywray commented 4 years ago

do testSteinFor <- nodeCentricSteinerForestProblem$new(fixedTerminalLymphomaGraph, verbose = FALSE, solverChoice = "rcbc") inside the loop just to make sure they are all independent?

adamsardar commented 4 years ago

Yeah, good idea. I actually think that this might have found a bug though - the lazy constraints are adding significant overhead on the search ...

adamsardar commented 4 years ago

Why didn't I get a better CPU when I upgraded this laptop!!?

Will benchmark that case afterwards

jonnywray commented 4 years ago

Bug finding is always a good thing

adamsardar commented 4 years ago

Calling it a bug is too much - the constraint generation is still correct.

adamsardar commented 4 years ago

Running CBC on a single thread - there seems to be a cumulative effect here ...

Maybe GLPK is much better at decreasing the constraints than CBC?

adamsardar commented 4 years ago

Ahem,

I've removed the cumulative connectivity constraints in the steinerForest routine (so when the fixed terminals change, the connectivity constraints are flushed because we are dealing with a new instance of the problem).

The decrease in runtime is astonishing:

First of all, the runtime is no longer slowly increasing.

But also, this is the same code as above - look at the y-axis! 100x speedup!

Also, we don't have that strange behaviour of GLPK being worse than CBC in terms of solution quality.

RCPCtime	RCPCvcount	RGLPKtime	RGLPKvcount	nFixedTerminals	trial
0.623	55	0.902	55	34	1
12.725	60	95.373	60	33	2
2.370	54	0.062	54	33	3
0.651	58	1.713	58	34	4
18.794	56	60.669	56	34	5
0.345	57	1.554	57	32	6
0.724	60	1.243	60	37	7
0.380	58	2.248	58	34	8
0.123	60	0.047	60	38	9
1.268	57	0.339	57	34	10
0.334	55	3.472	55	32	11
0.591	55	0.309	55	37	12
1.043	54	5.194	54	31	13
0.533	49	0.593	49	31	14
11.007	66	27.296	66	39	15
1.840	57	0.179	57	34	16
0.709	57	0.566	57	35	17
3.090	72	97.028	72	43	18
0.190	64	0.353	64	39	19
2.941	74	3.033	74	43	20
0.928	53	1.889	53	31	21
7.117	54	43.801	54	31	22
6.274	58	9.972	58	36	23
0.640	59	0.052	59	36	24
2.462	54	3.642	54	32	25
0.533	44	0.851	44	26	26
2.343	46	0.199	46	28	27
3.376	59	43.497	59	32	28
3.612	60	3.044	60	40	29
5.123	61	33.925	61	39	30
1.156	55	0.542	55	34	31
9.863	56	20.064	56	30	32
0.343	66	0.331	66	36	33
9.039	48	6.709	48	27	34
0.706	57	1.070	57	36	35
2.962	60	0.053	60	34	36
0.248	61	0.289	61	37	37
0.400	59	5.937	59	35	38
0.340	63	1.290	63	40	39
30.557	60	19.091	60	35	40
1.669	50	5.479	50	30	41
0.462	54	6.302	54	31	42
0.346	59	0.680	59	36	43
3.794	55	2.269	55	34	44
3.071	56	14.034	56	33	45
13.961	53	10.606	53	34	46
5.347	62	14.911	62	36	47
10.022	62	23.975	62	38	48
7.940	65	152.686	65	39	49
1.338	64	1.944	64	41	50
2.016	56	0.339	56	34	51
0.998	68	2.199	68	40	52
0.364	62	3.600	62	37	53
0.281	49	0.259	49	30	54
3.034	59	2.430	59	35	55
0.280	50	1.159	50	32	56
4.560	52	1.954	52	32	57
0.828	57	1.727	57	33	58
0.311	66	0.537	66	42	59
0.225	52	0.195	52	33	60
10.576	63	0.853	63	39	61
0.636	51	1.177	51	30	62
5.472	57	9.441	57	31	63
0.951	62	0.353	62	40	64
4.103	57	10.569	57	32	65
7.875	64	1.283	64	40	66
3.778	56	1.616	56	31	67
2.757	66	0.632	66	40	68
0.912	56	0.269	56	33	69
2.022	58	8.052	58	38	70
1.376	62	1.078	62	37	71
0.288	58	0.057	58	35	72
1.962	52	5.357	52	28	73
1.825	49	4.557	49	30	74
1.245	63	0.728	63	39	75
1.588	43	2.859	43	27	76
3.051	57	0.288	57	36	77
0.845	57	1.590	57	35	78
1.028	56	1.901	56	32	79
0.680	52	0.387	52	30	80
0.512	62	3.018	62	36	81
1.444	62	10.187	62	38	82
3.711	54	1.234	54	34	83
0.508	51	0.478	51	31	84
1.046	60	3.888	60	34	85
1.603	58	0.444	58	34	86
5.340	56	208.878	56	33	87
2.602	62	0.330	62	38	88
0.695	52	0.280	52	34	89
0.586	58	1.339	58	33	90
0.778	49	0.687	49	28	91
4.499	52	2.590	52	33	92
4.531	62	10.925	62	37	93
0.878	57	12.693	57	34	94
1.452	59	53.015	59	33	95
0.323	55	0.394	55	34	96
0.434	49	4.607	49	31	97
2.391	68	0.714	68	40	98
0.996	51	3.490	51	33	99
0.284	54	1.820	54	35	100

jonnywray commented 4 years ago

Interesting - so both speed up, but CBC by more. Curious to see what those changes do to the problems that seems intractable previously

adamsardar commented 4 years ago

It depends on why they were intractable - single Steiner tree problems (or collection of sub-optimal solutions), this will make no difference. But for the Steiner Forest problem the improvement is huge.

adamsardar commented 4 years ago

Some more benchmarks! Whereas previously I was running the exact same minimum Steiner tree problem, I am not trialling 100 iterations of the full Steiner forest process.

Runtime

Making repeated calls of the form:

  rcbcTime <- system.time(RCBCsol <- nodeCentricSteinerForestProblem$new(fixedTerminalLymphomaGraph, verbose = FALSE, solverChoice = "rcbc")$sampleMultipleBootstrapSteinerSolutions())

(i.e. running the Steiner forest routine using the defaults; 5xbootstraps and 0xiterations for sub-optimal solutions)

CBC seems a little faster (although there isn't that much in it - the speedups have been algorithmic, not solver based).

However ...

Module size

The resultant modules from CBC are larger than those from GLPK. There are two reasons why this might happen - GLPK might just find better (smaller) solutions or CBC might find more diverse solutions. I'll need to run another benchmark where I collect the module sizes in the solution pool (which are aggregated to give the result).

Last benchmark; the Steiner forest process with 100 bootstraps (i.e. resampling the seed set 100x) and collecting up to five degenerate solutions on each run. I ran the process 15 times (I wanted to finish it overnight and the numbers below seem conclusive). Whilst each of these 15x2=30 runs have different sub-problems (the sampling step is stochastic), comparison of means is still reasonable. This stands in contrast to the first benchmark, where exactly the same Steiner tree problem was solved.

Running something like:

glpktim <- system.time( SteinForGLPK <- nodeCentricSteinerForestProblem$new(fixedTerminalLymphomaGraph, solverChoice = "RGLPK", verbose = FALSE)$sampleMultipleBootstrapSteinerSolutions(nBootstraps = 100, maxItr = 5) )

  cbctime <- system.time( SteinForCBC <- nodeCentricSteinerForestProblem$new(fixedTerminalLymphomaGraph, solverChoice = "RCBC", verbose = FALSE)$sampleMultipleBootstrapSteinerSolutions(nBootstraps = 100, maxItr = 5) )

First of all - CBC wins on time:

Over an hour to only a few minutes!!

But as before, the CBC aggregated modules are larger:

Inspecting the individual modules collected (i.e. the solutions to the sub-problem of resampled terminals) the solutions seem equivalent (in size).

The reason for the larger aggregated modules under CBC is simply that the solver is able to find far more solutions to the subproblems:

So there we have it: CBC is much faster and finds many more solutions. This is sufficient in my mind to change the recommended solver.

adamsardar / stoneTrees

Include CBC as a supported solver #9

Runtime

Module size

Next