ERGO-Code / HiGHS

Linear optimization software
MIT License
933 stars 174 forks source link

.lp file reader has unusably large RAM requirements on large .lp files #886

Open kkofler opened 2 years ago

kkofler commented 2 years ago

I have tried running the HiGHS CLI on a 1.9 MiB .lp file. The .lp file reader fails spectacularly on it, thrashing swap despite this machine having 16 GiB of RAM! That file can be read just fine by other solvers.

Valgrind Massif points to a single function as the culprit for the memory allocations: Reader::readNextToken from extern/filereaderlp/reader.cpp. 77% of the allocations come directly from Reader::readNextToken, 23% from the std::vector::_M_realloc_insert method that it calls, all other allocations are negligible (at least at the time point where I interrupted the measurement due to RAM exhaustion, which was still during input file parsing).

As far as I can tell, this method heap-allocates a C++ object for every single read token (which by itself is already a no-no), wraps it in a std::unique_ptr, and then appends it to the buffer vector this->rawtokens. This buffer is only appended to and never removed from. It is impossible to parse large files using this algorithm.

For memory efficiency, the parser needs to stream tokens from the lexer (as opposed to the lexer appending them to a buffer – instead, streaming means the lexer returns to the parser after each token, with the token as the return value, and the parser directly processes that return value), any already parsed tokens need to be discarded (not buffered), and the token representation needs to be as compact and as devoid of indirections (requiring heap allocations) as possible (best use a POD struct RawToken { RawTokenType type; union {double d; const char *s; } data; } without inheritance and pass that around by value rather than by unique_ptr, then only the string data needs one allocation, all other token types need none).

jajhall commented 2 years ago

Sorry about this. The LP file reader hasn't been tested on files of this size - not that it's particularly large.

Thanks for your detailed observations: I'm sure that they will mean something to the author of the LP file reader, and hope that they can be addressed relatively easily.

feldmeier commented 2 years ago

Thanks for reporting this issue.

Yes, the .lp filereader could be more efficient, but that would require a complete rewrite taking up amount of time I don't have at the moment. I would be happy to review any pull requests implementing these changes.

Right now, the .lp filereader should take linear (in model size) amount of memory - as you noted, it stores each token in a buffer. This is not horrible, though, since this should be in the same order of magnitude as storing the HighsModel afterwards (of which multiple copies exist at runtime anyways).

However, a 1.9 MB file should never end up using 16 GB of RAM, even when storing all tokens without removing them from the buffer. I suspect there is another bug - maybe an infinite loop continuously adding tokens - in the filereader. Would you mind sharing the .lp file to track this down?

svigerske commented 2 years ago

We also have some performance problems with the LP reader (a 2.9GB file takes over 50GB RAM and over 15 minutes to read, while CPLEX reads it in 60 seconds), but I agree that for 1.9MB the current implementation shouldn't take over 16GB. It's not clear which version of HiGHS was used here, unfortunately. #841 fixed an issue about running into an infinite loop.

I'm going to try some of the suggested improvements in branch 886-improve-lp-reader.

jajhall commented 2 years ago

Thanks. @feldmeier found the reported behaviour to be odd, but if you've seen it in GAMS, then anything you can do to address it would be much appreciated.

jajhall commented 2 years ago

Is this improved if you use master now that @svigerske has improved the performance of the .lp reader?