KardinalAI / coin_cbc

Rust bindings to the CoinOR CBC MILP Solveur using the C API.
MIT License
16 stars 14 forks source link

Major performance difference between Python Cbc and Rust coin_cbc solutions. #30

Open Thell opened 1 year ago

Thell commented 1 year ago

Hello again! First, let me start by saying I am not sure if this should be an issue for coin_cbc or good_lp and it could simply be that how I'm building the model isn't a 1:1 comparison to my building of the model in python with the use of pywraplp.

Next, the basic timings:

144 s - Rust single threaded - uses ~ 10% of all cores.
 81 s - Rust thread safe x16 parallel threads using rust built model (using `good_lp` interface) - uses ~ 40% of all cores.
 31 s - Python thread safe x16 parallel threads - uses ~100% of all cores.

Each is solving 7,353 instances of a model where only two bounds need changing. The Python implementation is reading the data from json file, parsing, solving and then post-processing the results where-as the Rust implementation is using hardcoded static input data and only counting the resulting optimal solutions. They are all using the Google OR-Tools distributed Cbc build.

An interesting thing is when I export the model via pywraplp and then read the model using coin_cbc I get an error from cbc the the import contained two read errors (two bounds with no match for a column), which when removed resolved the error and didn't alter the result.

I profiled the Rust code and it stated 93% of the time was being spent in the solve() but why so slow?!? The only thing I could think of was the model being different and indeed they are (see attached with txt extension). To verify I needed to read the .mps into but that is only available when using the raw::model which means I am also bypassing anything happening from there all the way up to good_lps Solvers. Not ideal but since the profiler said it was in the solve and feasibility checks I'm thinking there shouldn't be anything with that much overhead.

EDIT: I cut out all of the code originally in this post based on the comments from @TeXitoi below and have put a repo up with a MWE. See post below...

TeXitoi commented 1 year ago

If you use good_lp, better to ask them. Else, you can try using the coin_cbc interface directly without using good_lp, and I can answer. A long time ago, I tried good_lp, and it was too slow for my needs.

When you say everything is in solve(), can't you check within? With a flamegraph, you should be able to spot the exact bottleneck.

TeXitoi commented 1 year ago

Also, it could be great if your code samples were runable, i.e. we can launch them easily on our computer. Here, there is no main function, and at least no items data.

Thell commented 1 year ago

If you use good_lp, better to ask them. Else, you can try using the coin_cbc interface directly without using good_lp, and I can answer. A long time ago, I tried good_lp, and it was too slow for my needs.

Oh that is interesting that there was enough overhead with good_lp that it actually was notable - that tells me a lot right there.

Perhaps I'll also try importing the rust exported model back into the ""raw::Model and seeing if that is slower... if it is then that would definitively point at the model being the culprit.

When you say everything is in solve(), can't you check within? With a flamegraph, you should be able to spot the exact bottleneck.

It'd work like that using dbg builds of the cbc libs but, as mentioned Google's release libs are being used and a flame graph is exactly how solve and the feasibility function (I forget it's name) got flagged for interest.

it could be great if your code samples were runable

True. My bad. I'll get that.

Thell commented 1 year ago

While cleaning up the code for an executable example I realized my earlier excitement of the time improvement from using an mps read file was due to the model being maximized instead of minimized and only counting the number of optimum solutions generated. Once this was resolved and after getting the Python and Rust code whittled down to just what was needed for this example here is where it stands... Rust is taking 2.5x longer than Python.

The example repo: https://github.com/Thell/coin_cbc_example with the Python code to run the process and to export an mps along with the Rust code to run the process (along with the supporting libs for threadsafe cbc).


In order to eliminate the good_lp or the main api of coin_cbc this example used the read_mps() method of the 'raw' model interface and does the following in a loop for each worker.

let mut good_count = 0;
for (state_1_sum_lb, state_2_sum_lb) in state_value_pairs_chunk.into_iter() {
    let mut model = problem.clone();
    model.set_row_lower(state_1_lb_row, *state_1_sum_lb as f64);
    model.set_row_lower(state_2_lb_row, *state_2_sum_lb as f64);
    model.set_obj_sense(Sense::Minimize);
    let _ = model.solve();
    good_count += !model.is_proven_infeasible() as u32;
}
good_count

Python using OR-Tools

> python --version
Python 3.11.3
> pip index versions ortools
WARNING: pip index is currently an experimental command. It may be removed/changed in a future release without prior warning.
ortools (9.6.2534)
Available versions: 9.6.2534, 9.5.2237
  INSTALLED: 9.6.2534
  LATEST:    9.6.2534
> measure-command {python .\python\subset-select-sample.py -s | write-host}
good_count: 6977

Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 20
Milliseconds      : 419
Ticks             : 204190785
TotalDays         : 0.000236331927083333
TotalHours        : 0.00567196625
TotalMinutes      : 0.340317975
TotalSeconds      : 20.4190785
TotalMilliseconds : 20419.0785

Rust using OR-Tools libs via coin_cbc

The coin_cbc being used is the coin_cbc_win setup for reading the CBC_ROOT environment variable to link to the Cbc libs. The OR Tools libs being used are from the or-tools_x64_VisualStudio2022_cpp_v9.6.2534 package and are in the lib directory so $Env:CBC_ROOT will need to be set for coin_cbc_win to find them.

> rustup show
Default host: x86_64-pc-windows-msvc
rustup home:  C:\Users\thell\.rustup

installed toolchains
--------------------

stable-x86_64-pc-windows-msvc (default)
nightly-x86_64-pc-windows-msvc

active toolchain
----------------

stable-x86_64-pc-windows-msvc (default)
rustc 1.68.2 (9eb3afe9e 2023-03-27)

The mps is exported from the python script and used in the Rust code as-is. Each worker reads in its own mps file.

> measure-command {.\target\release\coin_cbc_example.exe | write-host}
Coin0008I subset_selection read with 0 errors
--->8---
Coin0008I subset_selection read with 0 errors
good_count: 6977

Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 50
Milliseconds      : 499
Ticks             : 504992250
TotalDays         : 0.000584481770833333
TotalHours        : 0.0140275625
TotalMinutes      : 0.84165375
TotalSeconds      : 50.499225
TotalMilliseconds : 50499.225

At this point there isn't much left to point a finger at for the cause of the performance difference and I am left thinking it will likely be the Cbc libs but as I have been unsuccessful in building Cbc libs that are both parallel and export the functions they are what I have and should be the same as what is being used for the OR-Tools wheel and Cython.