European-XFEL / karabo-bridge-py

Tools to allow data exchange with Karabo, in particular streaming of data
BSD 3-Clause "New" or "Revised" License
9 stars 4 forks source link

Use zero-copy send in simulator and client #44

Closed takluyver closed 6 years ago

takluyver commented 6 years ago

This avoids copying the big detector array when the simulator sends the message. There's a size threshold of 64 KB, so smaller messages will still be sent by copying memory as normal.

This reduces the time spent in send_multipart by around three orders of magnitude (~200ms to ~200µs) . I suspect there's some platform-level optimisation when we use np.zeros() that mean the memory is never really accessed at all.

(This came from investigating #43)

takluyver commented 6 years ago

(My guess about np.zeros() appears to be correct - the kernel will pretend it's giving you a big block of empty memory, but not actually allocate it until you try to write to it. In this case, we never write to it, so it's all mapped to the same memory page)

takluyver commented 6 years ago

I think we're being killed for using too much memory (> 4 GB) on Travis. Something about the zero-copy receive might be resulting in some memory not getting freed. Still trying to figure this out.

codecov-io commented 6 years ago

Codecov Report

Merging #44 into master will decrease coverage by 0.24%. The diff coverage is 92.85%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #44      +/-   ##
==========================================
- Coverage   77.97%   77.72%   -0.25%     
==========================================
  Files           6        6              
  Lines         454      458       +4     
==========================================
+ Hits          354      356       +2     
- Misses        100      102       +2
Impacted Files Coverage Δ
karabo_bridge/cli/simulation.py 0% <ø> (ø) :arrow_up:
karabo_bridge/client.py 88.05% <100%> (-0.18%) :arrow_down:
karabo_bridge/simulation.py 77.68% <88.88%> (-0.39%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 941b785...f23b603. Read the comment docs.

takluyver commented 6 years ago

On further investigation, I think the problem wasn't a 'real' memory leak, but rather objects not being garbage-collected quickly enough to keep memory usage low.

Python's garbage collector runs when a count of allocations minus deallocations reaches a threshold. So creating a relatively small number of big objects can quickly use up memory before the garbage collector kicks in to release them. The zero-copy machinery must have affected this somehow - perhaps by introducing reference cycles, so arrays were no longer freed by simple refcounting.

So I've introduced an option to simulate one AGIPD module, and used it in the tests. This makes the arrays 16 MB instead of 256 MB. If I'm right about the cause, smaller allocations will build up less memory before it's collected by garbage collection.

(Of course, if I'm wrong and there is a real memory leak, smaller allocations just put the problem off to later)

takluyver commented 6 years ago

I've now run a client which received and discarded >7000 pulses while I monitored its memory use, and I've satisfied myself that it appears to be just an issue with garbage collection, not an actual memory leak.

The memory usage of my client script repeatedly grows over ~10 seconds to >20 GB, then drops sharply (presumably when GC occurs), and starts growing again. The maximum gradually increases to just over 30 GB, but then a larger drop (GC generation 1?) brings it back down. I couldn't see any long-term increase in memory consumption.

I'd expect this pattern to be less pronounced in a real-life scenario, as the code analysing the data will probably make more allocations, so the garbage collector will be triggered more often.