Closed takluyver closed 6 years ago
(My guess about np.zeros() appears to be correct - the kernel will pretend it's giving you a big block of empty memory, but not actually allocate it until you try to write to it. In this case, we never write to it, so it's all mapped to the same memory page)
I think we're being killed for using too much memory (> 4 GB) on Travis. Something about the zero-copy receive might be resulting in some memory not getting freed. Still trying to figure this out.
Merging #44 into master will decrease coverage by
0.24%
. The diff coverage is92.85%
.
@@ Coverage Diff @@
## master #44 +/- ##
==========================================
- Coverage 77.97% 77.72% -0.25%
==========================================
Files 6 6
Lines 454 458 +4
==========================================
+ Hits 354 356 +2
- Misses 100 102 +2
Impacted Files | Coverage Δ | |
---|---|---|
karabo_bridge/cli/simulation.py | 0% <ø> (ø) |
:arrow_up: |
karabo_bridge/client.py | 88.05% <100%> (-0.18%) |
:arrow_down: |
karabo_bridge/simulation.py | 77.68% <88.88%> (-0.39%) |
:arrow_down: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 941b785...f23b603. Read the comment docs.
On further investigation, I think the problem wasn't a 'real' memory leak, but rather objects not being garbage-collected quickly enough to keep memory usage low.
Python's garbage collector runs when a count of allocations minus deallocations reaches a threshold. So creating a relatively small number of big objects can quickly use up memory before the garbage collector kicks in to release them. The zero-copy machinery must have affected this somehow - perhaps by introducing reference cycles, so arrays were no longer freed by simple refcounting.
So I've introduced an option to simulate one AGIPD module, and used it in the tests. This makes the arrays 16 MB instead of 256 MB. If I'm right about the cause, smaller allocations will build up less memory before it's collected by garbage collection.
(Of course, if I'm wrong and there is a real memory leak, smaller allocations just put the problem off to later)
I've now run a client which received and discarded >7000 pulses while I monitored its memory use, and I've satisfied myself that it appears to be just an issue with garbage collection, not an actual memory leak.
The memory usage of my client script repeatedly grows over ~10 seconds to >20 GB, then drops sharply (presumably when GC occurs), and starts growing again. The maximum gradually increases to just over 30 GB, but then a larger drop (GC generation 1?) brings it back down. I couldn't see any long-term increase in memory consumption.
I'd expect this pattern to be less pronounced in a real-life scenario, as the code analysing the data will probably make more allocations, so the garbage collector will be triggered more often.
This avoids copying the big detector array when the simulator sends the message. There's a size threshold of 64 KB, so smaller messages will still be sent by copying memory as normal.
This reduces the time spent in
send_multipart
by around three orders of magnitude (~200ms to ~200µs) . I suspect there's some platform-level optimisation when we usenp.zeros()
that mean the memory is never really accessed at all.(This came from investigating #43)