canonical / operator

Pure Python framework for writing Juju charms
Apache License 2.0
245 stars 119 forks source link

Optimise Scenario #1434

Open PietroPasotti opened 2 weeks ago

PietroPasotti commented 2 weeks ago

The unit testing suite for traefik is probably the largest scenario test battery around. Today I ran it and I realized it took over a minute and a half to complete, so I decided to profile it.

This is the result: image

to produce the profile, run with this tox env:

[testenv:profile-scenario-tests]
description = Profile scenario unit test battery
deps =
    pytest
    pytest-profiling
    ops[testing]>=2.17
    -r{toxinidir}/requirements.txt
commands =
    pytest -v --tb native {[vars]tst_path}/scenario --profile --profile-svg --log-cli-level=INFO -s {posargs}

There are some obvious candidates for optimization:

profiling scenario's own test suite yields a very similar picture: image

tonyandrewmeyer commented 1 week ago
  • using an in-memory mock for the sqlite db instead of using the real thing could shave off a good chunk of time spent in pointless IO

This indeed seems like an obvious win. Presumably we can just pass in :memory: instead of the unit state filename - trivial to do, and we're not testing sqlite so no downsides.

  • A ridiculous amount of time is spent in State.__new__. Can we do something about that?
  • A single list comprehension in state.py:143 takes 2 seconds of our time: can we lazify some of the code perhaps?

Both of these are the positional/keyword argument processing. I wonder if we can find a tidy way to have newer Python use the built-in support and only fall back to this on old Python.

  • how come mocking juju-log takes so long?

I wonder if this is also the same issue: lots of log lines and each one involves creating a JujuLogLine object.

benhoyt commented 1 week ago

Presumably we can just pass in :memory: instead of the unit state filename - trivial to do, and we're not testing sqlite so no downsides.

Yeah, this is probably the single biggest win. I made this change locally when running tox -e scenario on the traefik-k8s-operator tests. My times are different, but I get about 35s without the change, and 20s with the change, so a nice improvement! FWIW, it looks like Harness already uses :memory:.

Happy for us to spend a few hours of rainy-day time looking at the other bottlenecks too. (Let's just be careful not to get too carried away on the long tail...)