Closed FilipeMaia closed 8 years ago
In general, we are using a memory jail on the workers to avoid runaway scenarios that could bring down the full machine. Some data transfers (especially if you play around wildly with buffer sizes) would use "a lot", while not "excessively much" memory. Then this is not really a bug.
Another, more interesting, aspect would be whether we have buffers piling up inside zmq or MPI causing the memory strain.
This looks very much like a bug that was fixed in pyzmq 14.2.0 and up. Exploring whether we might be using an older version.
We were using 14.1.1. Now switched to 15.3.0. Very preliminary results point to no remaining leak.
Not looking so good anymore.
Bug isolated to mpi4py (or OpenMPI?), in message receipt loop in the master. If messages are not sent/received, the leak is absent. If messages are scrapped immediately after receipt, the leak is still present. It is enough that the messages sent correspond to reduce calls, and it persists even if replacing numpy arrays with empty strings...
OpenMPI 1.8.6 was the version used at LCLS by default. After assistance from pcds-help we now have a local release including OpenMPI 1.8.8 instead. The memory leak behavior is now absent or much reduced.