Crash with searches over 5000 nodes when running on small memory GPUs

dje-dev / Ceres

Ceres - an MCTS chess engine for research and recreation

GNU General Public License v3.0

153 stars 23 forks source link

Crash with searches over 5000 nodes when running on small memory GPUs #8

Closed dje-dev closed 3 years ago

dje-dev commented 3 years ago

Ceres switches into a different mode at the 5000 node threshold, and a second backend session is created, doubling memory requirements (beyond that of LC0). On small GPUs with big networks this may result in a crash. At a minimum, error reporting should be improved.

dje-dev commented 3 years ago

As suggested by borg, vram requirements can be reduced by reducing the backend's max batch size, probably something like: -- backend-opts=(backend=cuda,gpu=0,max_batch=512). The engine then needs to also understand not to build batches in excess of this size. Probably a user bool setting like "small-memory" could be introduced which would cut various parameters in half such as this batch size and also other internal data structures. Possibly set this automatically if the GPU has little RAM.

dje-dev commented 3 years ago

Ceres v0.90-rc1 (just released) significantly reduces GPU memory consumption, and is now nearly identical to that of LC0.