Open pspacek opened 6 months ago
I think this is probably the conjecture data cache in internal/conjecture/engine.py
. It is bounded but large, at CACHE_SIZE = 10000
examples.
There is a memory/performance trade-off here, and the benefits of this cache are mainly in the shrinking phase.
FWIW: Changing to CACHE_SIZE = 1000
looks like it caps the stateful_minimal
test to under 50MB. But there are other large-ish caches, so to identify leaks as opposed to plus-size caches we probably need to run 5-10k examples just to get to steady state.
I wanted to see when memory consumption flattens out so I've modified stateful_minimal.py.gz with max_examples=5000000
and ran it with:
$ pytest --hypothesis-show-statistics --hypothesis-explain stateful_minimal.py
Occasionally this fails with:
E hypothesis.errors.Flaky: Inconsistent results from replaying a test case!
E last: VALID from None
E this: INTERESTING from DeadlineExceeded at /usr/lib/python3.12/site-packages/hypothesis/core.py:846
Sometimes it does finish, after about 5 minutes:
stateful_minimal.py::stateful_minimal::runTest:
- during reuse phase (0.00 seconds):
- Typical runtimes: ~ 1ms, of which ~ 1ms in data generation
- 1 passing examples, 0 failing examples, 0 invalid examples
- during generate phase (313.17 seconds):
- Typical runtimes: ~ 0-3 ms, of which ~ 0-3 ms in data generation
- 69508 passing examples, 0 failing examples, 0 invalid examples
- Stopped because nothing left to do
At this point it ate 440 MB of memory.
Here's memory consumption over time, with 10 second sampling interval:
RssAnon: 24320 kB
RssAnon: 93056 kB
RssAnon: 162176 kB
RssAnon: 229632 kB
RssAnon: 291840 kB
RssAnon: 337616 kB
RssAnon: 354128 kB
RssAnon: 355280 kB
RssAnon: 355280 kB
RssAnon: 355360 kB
RssAnon: 355360 kB
RssAnon: 355360 kB
RssAnon: 355360 kB
RssAnon: 354792 kB
RssAnon: 354792 kB
RssAnon: 355816 kB
RssAnon: 361192 kB
RssAnon: 366056 kB
RssAnon: 371176 kB
RssAnon: 375656 kB
RssAnon: 382952 kB
RssAnon: 389480 kB
RssAnon: 393064 kB
RssAnon: 403048 kB
RssAnon: 411624 kB
RssAnon: 421224 kB
RssAnon: 429544 kB
RssAnon: 435688 kB
RssAnon: 439912 kB
RssAnon: 450792 kB
(Generated with trivial while true; do grep RssAnon /proc/`pgrep -f unittest`/status; sleep 10; done
.)
It is possible that it caches all 69508 examples? Or why it should be still consuming more memory?
Please note I'm not saying this is some huge problem - at least until we get to hypofuzzing state machines (which is the original motivation for my experiment).
Hm. I wonder if that DeadlineExceeded
might be from a GC run. Also wonder if that post-plateau growth would be influenced by forcing GC occasionally?
So, I ran the test in memray to get better measurements, and in summary I don't see strong evidence of leaks. What I think we're seeing is that memory consumption has two main drivers:
CACHE_SIZE
in engine.py
.DataTree
which tracks previously tried data to avoid repeats, at around 1kB per example. Unbounded.As evidence (running the stateful test above, about 70k examples):
Default run
Run with CACHE_SIZE=1000
Possible actions:
DeadlineExceeded
, which I guess is caused by GC, is an annoying flake. Should we perhaps force periodic gc.collect()
outside of the tests, or add gc callbacks to account for the time taken?CACHE_SIZE
or somehow be more selective in what we cache or retain.@Zac-HD , @tybug , does this sound reasonable?
I think either of you has a better intuition for memory size than I. I would say it is very possible memory usage has increased recently due to ir-related churn and we should do a detailed pass one things settle. In particular 1kb/example for DataTree
sounds a bit high to me. I wonder what these graphs look like pre-ir; say v6.98.0
.
That's larger than I was expecting, but makes sense once I actually think about the data structures and objects involved. I agree that it's plausible the IR migration is causing temporarily higher memory usage, both because we're carrying around two representations now, and because the IR hasn't really been memory-optimized yet. Not worth trying to fix that before we finish migrating though.
Adding gc.collect()
before each (or each 1000th) invocation of the user code isn't a terrible idea, but I'm not sure it'd avoid occasional gc pauses anyway - for that you'd have to call gc.freeze()
, and that's a semantics change I don't want.
Good idea to compare with pre-ir! v6.98.0
below, at first I left it running over late breakfast and it didn't finish... I guess the exhaustion tracking has improved! Anyway, with number of examples set equal:
You'll notice immediately that we use a lot less memory in the post-ir world: the (probable) DataTree
contribution is halved or more. Yay!
Less obviously, we produce more reference cycles, so the GC works harder. It's worth keeping an eye on this during development/migration, since data structures are harder than code to "fix" afterwards .
[...] it's plausible the IR migration is causing temporarily higher memory usage, both because we're carrying around two representations now, and because the IR hasn't really been memory-optimized yet. Not worth trying to fix that before we finish migrating though.
Yep, agree on this (with both of you). We can revisit afterwards - for the record, if nothing else.
Adding
gc.collect()
before each (or each 1000th) invocation of the user code isn't a terrible idea, but I'm not sure it'd avoid occasional gc pauses anyway - for that you'd have to callgc.freeze()
, and that's a semantics change I don't want.
Or alternatively record and account for the pauses using gc.callbacks
. Worth it? I don't know, but it is likely to hit any test with high example count and default deadline setting.
Or alternatively record and account for the pauses using
gc.callbacks
. Worth it? I don't know, but it is likely to hit any test with high example count and default deadline setting.
Oh, that's a better idea, we can attribute gc time correctly in statistics and observability output too.
Versions tested
Summary
For some reason even trivial stateless test and stateful test cases leak memory at rate of several megabytes per second.
Reproducers
Here are two test-nothing test cases:
test_given.py.gz - leaks on average ~ 2 MB/s, in the second half just 1 MB/s. On my machine the memory consumption goes form 25 MB to 145 MB at the end.
stateful_minimal.py.gz - leaks on average ~ 7 MB/s, in the second half of runtime still 7 MB/s . On my machine the memory consumption goes form 25 MB to 170 MB at the end.
Both files can be executed with
python -m unittest
so the pytest is out of equation.I've attempted memory profiling using Scalene profiler version 1.5.41 but I did not get anything conclusive. At first it pointed to
random.getstate()
, so I've butcheredinternal/entropy.py:deterministic_PRNG()
andcontrol.py:deprecate_random_in_strategy()
by commenting out all calls torandom.getstate()
but the overall memory consumption did not move. Scalene just started to claim that memory went to different places.I don't trust Scalene's memory attribution very much, but there certainly is some sort of leak because running tests in memory-constrained environment invokes OOM killer:
Simplest way to verify that more memory is being consumed, and not just "virtually", is adjusting
settings(max_examples=100)
vs.settings(max_examples=5000)
. In BASH limiting memory withulimit -d 50000
allows small test cases to finish but larger examples explode.I don't understand subtleties of
ulimit
but similar effect can be achieved usingsystemd-run --user --shell -p MemoryMax=50M -p MemoryHigh=45M -p MemorySwapMax=0B
command if you are on a system with Systemd. In that case OOM killer does not allow larger test cases to finish.