FXIhub / hummingbird

Monitoring and Analysing flash X-ray imaging experiments
http://fxihub.github.io/hummingbird
BSD 2-Clause "Simplified" License
16 stars 14 forks source link

Backend sometimes crashes with MemoryError #65

Closed FilipeMaia closed 8 years ago

FilipeMaia commented 8 years ago
[1,0]<stderr>:  File "./../hummingbird/hummingbird.py", line 67, in <module>
[1,0]<stderr>:    main()
[1,0]<stderr>:  File "./../hummingbird/hummingbird.py", line 51, in main
[1,0]<stderr>:    worker.start()
[1,0]<stderr>:  File "/reg/neh/operator/amoopr/amo87215/hummingbird/src/backend/worker.py", line 79, in start
[1,0]<stderr>:    self.event_loop()
[1,0]<stderr>:  File "/reg/neh/operator/amoopr/amo87215/hummingbird/src/backend/worker.py", line 96, in event_loop
[1,0]<stderr>:    ipc.mpi.master_loop()
[1,0]<stderr>:  File "/reg/neh/operator/amoopr/amo87215/hummingbird/src/ipc/mpi.py", line 97, in master_loop
[1,0]<stderr>:    msg = comm.recv(None, MPI.ANY_SOURCE, status = status)
[1,0]<stderr>:  File "Comm.pyx", line 816, in mpi4py.MPI.Comm.recv (src/mpi4py.MPI.c:72032)
[1,0]<stderr>:  File "pickled.pxi", line 250, in mpi4py.MPI.PyMPI_recv (src/mpi4py.MPI.c:29545)
[1,0]<stderr>:  File "pickled.pxi", line 111, in mpi4py.MPI._p_Pickle.load (src/mpi4py.MPI.c:28058)
[1,0]<stderr>:MemoryError
cnettel commented 8 years ago

In general, we are using a memory jail on the workers to avoid runaway scenarios that could bring down the full machine. Some data transfers (especially if you play around wildly with buffer sizes) would use "a lot", while not "excessively much" memory. Then this is not really a bug.

Another, more interesting, aspect would be whether we have buffers piling up inside zmq or MPI causing the memory strain.

cnettel commented 8 years ago

This looks very much like a bug that was fixed in pyzmq 14.2.0 and up. Exploring whether we might be using an older version.

cnettel commented 8 years ago

We were using 14.1.1. Now switched to 15.3.0. Very preliminary results point to no remaining leak.

cnettel commented 8 years ago

Not looking so good anymore.

cnettel commented 8 years ago

Bug isolated to mpi4py (or OpenMPI?), in message receipt loop in the master. If messages are not sent/received, the leak is absent. If messages are scrapped immediately after receipt, the leak is still present. It is enough that the messages sent correspond to reduce calls, and it persists even if replacing numpy arrays with empty strings...

cnettel commented 8 years ago

OpenMPI 1.8.6 was the version used at LCLS by default. After assistance from pcds-help we now have a local release including OpenMPI 1.8.8 instead. The memory leak behavior is now absent or much reduced.

See https://github.com/open-mpi/ompi-release/pull/357