draffensperger / golp

Go bindings for LPSolve, a Mixed Integer Linear Programming (MILP) solver
MIT License
78 stars 21 forks source link

Erratic behaviour with larger number of rows/cols #22

Closed alexykot closed 1 month ago

alexykot commented 1 month ago

I'm trying to implement a column generation algorithm for a 1-dimension cutting stock problem, and with larger datasets I'm observing some very puzzling and erratic behaviour that I can't understand.

So I have a column generation solution for the SCP, and on small datasets it works really well.

This is a solution for calculating pieces of timber from stocks of different sizes, for a timber frame dwelling construction, and "small" in my context is 1 stock length option and 10 different kinds of items to be cut out of it. This usually means 15-20 columns (cutting patterns), and it generates a sensible high quality answer within a few seconds.

However, when I increase the size of the problem to more meaningful size for my purposes, which is 5 stock length options and 30-40 different part sizes (200-300 individual items) - the behaviour of the implementation becomes erratic and unstable. No other code changes ofc, just different inputs.

These kind of oddities start appearing:

160. {0 [0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0]}
161. {1 [0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0]}
162. {2 [0 4598175219545276420 4594899874361734237 4594899874361734241 4594899874361734237 4591215111030249294 4591215111030249295 0 0 4591215111030249294 4591215111030249294 0 0 4594081038065848694 0 0 0 4589168020290535427 0 0 4582207911775508285 0 4594121979880642965 4596373779694328210 0 4589577438438478202 0 0 0 0 0 0 0 1 0 0 0]}
163. {3 [0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0]}
164. {0 [2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]}

goroutine 1 gp=0x140000021c0 m=4 mp=0x14000080008 [running]: runtime.throw({0x102a1fc5b?, 0x102a08fc8?}) /opt/homebrew/Cellar/go/1.22.3/libexec/src/runtime/panic.go:1023 +0x40 fp=0x140000c93a0 sp=0x140000c9370 pc=0x1029a5120 runtime.sigpanic() /opt/homebrew/Cellar/go/1.22.3/libexec/src/runtime/signal_unix.go:895 +0x22c fp=0x140000c9400 sp=0x140000c93a0 pc=0x1029bd14c runtime.(pollDesc).info(...) /opt/homebrew/Cellar/go/1.22.3/libexec/src/runtime/netpoll.go:141 runtime.netpollcheckerr(...) /opt/homebrew/Cellar/go/1.22.3/libexec/src/runtime/netpoll.go:511 internal/poll.runtime_pollReset(0x80000000000?, 0x7ffff80000000000?) /opt/homebrew/Cellar/go/1.22.3/libexec/src/runtime/netpoll.go:318 +0x4 fp=0x140000c9410 sp=0x140000c9410 pc=0x1029d3744 internal/poll.(pollDesc).prepare(0x140000940c0?, 0x102a039f4?, 0xac) /opt/homebrew/Cellar/go/1.22.3/libexec/src/internal/poll/fd_poll_runtime.go:68 +0x28 fp=0x140000c9440 sp=0x140000c9410 pc=0x1029fefb8 internal/poll.(pollDesc).prepareWrite(...) /opt/homebrew/Cellar/go/1.22.3/libexec/src/internal/poll/fd_poll_runtime.go:77 internal/poll.(FD).Write(0x140000940c0, {0x14000178000, 0x54, 0x80}) /opt/homebrew/Cellar/go/1.22.3/libexec/src/internal/poll/fd_unix.go:371 +0xc8 fp=0x140000c94f0 sp=0x140000c9440 pc=0x1029ff528 os.(File).write(...) /opt/homebrew/Cellar/go/1.22.3/libexec/src/os/file_posix.go:46 os.(File).Write(0x140000aa018, {0x14000178000?, 0x54, 0x140000c9670?}) /opt/homebrew/Cellar/go/1.22.3/libexec/src/os/file.go:189 +0x5c fp=0x140000c9550 sp=0x140000c94f0 pc=0x1029ffdec fmt.Fprintf({0x102a7f8a8, 0x140000aa018}, {0x102a1fdf5, 0x7}, {0x140000c9670, 0x2, 0x2}) /opt/homebrew/Cellar/go/1.22.3/libexec/src/fmt/print.go:225 +0x84 fp=0x140000c95b0 sp=0x140000c9550 pc=0x102a03a24 fmt.Printf(...) /opt/homebrew/Cellar/go/1.22.3/libexec/src/fmt/print.go:233 github.com/alexykot/brick/lib/scp.(columnGenSolver1D).solveRMPInteger(0x140000c9b10, {0x14000216008, 0xea, 0x102b04150?}) /Users/alexykot/Work/Projects/Brick/Lang/go/lib/scp/column_gen_solver_1d.go:358 +0x1e8 fp=0x140000c96d0 sp=0x140000c95b0 pc=0x102a12f98 github.com/alexykot/brick/lib/scp.(columnGenSolver1D).columnGeneration(0x140000c9b10) /Users/alexykot/Work/Projects/Brick/Lang/go/lib/scp/column_gen_solver_1d.go:150 +0x5b0 fp=0x140000c9880 sp=0x140000c96d0 pc=0x102a11d30 github.com/alexykot/brick/lib/scp.(*columnGenSolver1D).Solve(0x140000c9b10) /Users/alexykot/Work/Projects/Brick/Lang/go/lib/scp/column_gen_solver_1d.go:67 +0x264 fp=0x140000c9970 sp=0x140000c9880 pc=0x102a11604 github.com/alexykot/brick/lib/scp.solveWithApproach({0x102b09000, 0x5, 0x5}, {0x102b09a00, 0x25, 0x25}, 0x10?, {0x102a20490, 0xa}) /Users/alexykot/Work/Projects/Brick/Lang/go/lib/scp/scp1d.go:83 +0x1e8 fp=0x140000c9bd0 sp=0x140000c9970 pc=0x102a183f8 github.com/alexykot/brick/lib/scp.SolveStockCutting1D({0x102b09000?, 0x5?, 0x5?}, {0x102b09a00, 0x25, 0x25}, 0x8?) /Users/alexykot/Work/Projects/Brick/Lang/go/lib/scp/scp1d.go:42 +0x208 fp=0x140000c9e30 sp=0x140000c9bd0 pc=0x102a17ce8 main.main() /Users/alexykot/Work/Projects/Brick/Lang/go/lib/scp/explorer1d/explorer.go:90 +0x64 fp=0x140000c9f40 sp=0x140000c9e30 pc=0x102a1d6a4 runtime.main() /opt/homebrew/Cellar/go/1.22.3/libexec/src/runtime/proc.go:271 +0x28c fp=0x140000c9fd0 sp=0x140000c9f40 pc=0x1029a7b1c runtime.goexit({}) /opt/homebrew/Cellar/go/1.22.3/libexec/src/runtime/asm_arm64.s:1222 +0x4 fp=0x140000c9fd0 sp=0x140000c9fd0 pc=0x1029d8414



- not "liking" one specific stock length. 4 other options calculated perfectly well in any combinations, while the last one, largest (probably just a coincindence) - does not get calculated under any conditions, the RMP integer just hangs.

- even `makeslice len out of range`, with suggested slice size looking like stack overflows again

This all appears to me like a memory corruption of sorts, like something is fiddling with the bits of my memory structures, but I can't see any source for it in my code.

--- 
The algorithm itself is a textbook implementation of Column Generation. I couldn't find anything ready-made in Go, so I had to implement one myself:
1. create initial suboptimal columns (patterns)
2. define Restricted Master Problem as float LP problem
3. take duals from the RMP solution, formulate a subproblem that would suggest next column (pattern) to add
4. rerun the RMP with expanded column set
5. try creating a new column again
6. keep trying until no more new columns are found
7. redefine RMP as integer problem, get the result

The suspicious thing is that this erratic behavior appeared after I added new column selection subproblem (step 3) as LP subproblem formulation. Before that I was using backtrack algorithm to generate new patterns, which was much slower and produced worse results, but it was stable (though took multiple minutes to run). 

Now with the subproblem as LP problem (and doubled use of golp and LP-solve) - it runs perfectly and in seconds on small problems, but is very unstable on larger ones.

I don't want to "blame the library", the problem is probably in my code, but I've been over it so many times in last few days that I can't see it anymore, and some input will be very appreciated. I can share my SCP implementation if that would be helpful.

I wasn't familiar with Linear Programming before, and had to figure out the basics for this task. I'm probably missing something obvious here.
draffensperger commented 1 month ago

I'm not sure I have enough expertise in LPSolve or linear programming to answer effectively here. I wrote this library as a wrapper a while back for a project but have not actively been using it recently.

alexykot commented 1 month ago

Actually, I think I've found the issue, and it is indeed memory corruption. It's not an LPSolve problem, it's a CGO problem.

After some desperate googling I came across a suggestion that passing a pointer to a Go type into C code may lead to issues as C may keep using the memory being pointed at, with no regard that the same memory addresses may be also used by Go code. And it looks like this is what's happening here.

I've swapped out using Duals() func that I've added earlier to iterating over DualResult() - and the issue has gone away.

My implementation of Duals() is passing a pointer to target Go slice to be filled in into the get_dual_solution() function. This is how it's intended to be used, and in simple cases it works fine. But it seems that the pointer to a slice that I'm passing in continues to be used somewhere inside the LPSolve beyond the mere filling in the dual values. Go frees up that memory after the Duals() function returns, and probably tries to reuse it for some other purposes, without knowledge that it is being also used by the C side of the program. It is not immediately visible in a simple single run in a test case, but under intense use this comes out as random corruption of other data structures on the Go side. There may be issues on the C side as well, but I can't observe those as easily.

Anyway, this is my best explanation for the issue, and for the fact that not using Duals() and retrieving same values using individual calls to DualResult(i) which returns a simple single float value for given index - solves the problem completely.


I'll see if I can modify the Duals() implementation to avoid passing on pointers from the Go side. Worst case - will have to make use of get_var_dualresult() internally, which is probably marginally less efficient, but is safe.

Btw, similar approach is used in Variables() implementation, where I copied it from. That one doesn't seem to cause any issues, at least I'm not observing anything with my usage pattern. I guess it's just implementation nuances on the LPSolve side, where pointer passed into get_variables() doesn't get reused, but pointer passed into get_dual_solution() - does. Since that one works fine - I'm not going to touch that, also there seems to be no other options in the LPSolve API for retrieving variables, while there are several options for duals. Not inclined to try and fix LPSolve itself, that's far beyond my current ability 🙂


Yeah, there is no blame on the library unless the library has your own code in it, and then - all bets are off 😃

draffensperger commented 1 month ago

Nice find here and thanks for posting all this! If you can identify a way to solve this for others as well that would be great.

alexykot commented 1 month ago

OK, I've added a PR #23 to change Duals() and fix this.

This is a stopgap solution that just iterates over whole length and retrieves every value individually. A better solution would have been probably to make use of get_ptr_dual_solution(), but I'm not sure right now how exactly that should be done, and seems that it will require some work in C directly. It's possible, but beyond what I can currently contribute, so I'll leave it for another time.

draffensperger commented 1 month ago

Thanks for the contribution! Closing as PR is merged now.