Tracking profiling run results

We would like to be able to track how the the timings measured in profiling runs of the src/scripts/profiling/scale_run.py script changes as new pull-requests are merged in. This would help identifying when PRs lead to performance regressions and allow us to be more proactive in fixing performance bottlenecks.

Ideally this should be as automated using GitHub Actions workflows. Triggering the workflow on pushes to master would give the most detail in terms of giving a direct measurement of the performance differences arising from a particular PR, but when lots of PRs are going in could potentially create a large backlog of profiling runs, so an alternative would be to run on a schedule (for example nightly) using the cron event. It would probably be worth also allowing triggering either using the workflow_dispatch event or using the comment-triggered workflow functionality to allow manually triggering in PRs that it is thought might have a significant effect on performance before merging.

Key questions to be resolved are what profiling outputs we want to track (for example at what level of granularity, using which profiling tool) and how we want to visualize the outputs. One option would be to save the profiler output as a workflow artifact. While this would be useful in allowing access to the raw profiling data, the only option for accessing workflow artifacts appears to be downloading the artifact as a compressed zip file so this is not necessarily itself that useful for visualizing the output. One option for visualizing the profiling results would be to use the GitHub Actions job summary which allows using Markdown to produce customized output showed on the job summary page. Another option would be to output the profiling results to HTML files and then deploy these to either a GitHub Pages site or potentially to a static site on Azure storage.

Potentially useful links

The airspeed velocity package allows tracking the results of benchmarks of Python packages overtime and visualizing the results as plots in a web interface. While focused on suites of benchmarks it does also have support for running single benchmarks with profiling.

htmlpreview allows directly previewing HTML files in a GitHub repository as GitHub forces them to use the "text/plain" content-type, so they cannot be interpreted

The developer onboarding says that we currently use pyinstrument to benchmark the scale_run script, so I thought I'd make a quick few comparisons against ASV:

ASV

(Claims to) integrate quite well with git:
- Can run regressions over multiple commits
- Has a find function function similar to git bisect to identify where the "biggest" slowdown in a period occurred. Could be useful for resolving situations where the cron job flags a slowdown at the end of a day, but the day includes multiple PR merges
- Results can be served directly to github pages as HTML.
- That being said, it creates a lot of artefacts in the git repo that need to be added to the .gitignore.
Prefers to use a conda environment to run benchmarks which integrates well with our current developer practices
Is designed to only run on one machine, as it saves JSON and database records of the benchmarks at each commit it is told to look at.
- We'd have to dedicate a fixed machine if we wanted to reap the benefits
- We'd also need to implement some kind of manual "clean-up" of results from the past to save being overwhelmed
Rather worryingly, it seems that asv is not actively maintained nor up-to-date
- Latest stable Python version it mentions is 3.6, not great when we're trying to update our dependencies.
- There is an issue here that essentially mirrors the concerns that we would have. (Credit to Sofia)

pyinstrument

No particular git integration, we'd have to keep track of which profiling results corresponded to which commits manually (although this doesn't seem too difficult)
- Outputs can be exported as html, so we could still manually deploy the results
Relies on a Python executable being on the path, so we'd need to setup a conda environment each time we ran the job and make sure the correct Python version was being found
- The Python API allows us to load previous results in too, so we can load in results and conduct further analysis if the initial output isn't sufficient
The above means it can happily run on any machine, so GH runners or any Azure machine that's available at the time would suffice
Actively maintained

The maintain-ability (?) issue jumps out as something of a red flag to me, but asv otherwise looks to have slighly better features at the cost of needing a dedicated machine. pyinstrument seems more flexible however; it's fairly easy to write a psuedocode GH action workflow using it right away:


- Checkout repository

- Setup conda
- Setup conda envrionment from developer/user docs
- Install pyinstrument into the evironment

- Run pyinstrument producing a HTML output (and maybe a session output so we can reload later)

- Push HTML file somewhere? Maybe to a separate branch that we an manually view the files with htmlpreview?

A couple of options (more details in this file)

ASV-based workflow/job, on a dedicated machine. Use asv run --profile so we collect both benchmarking and profiling outputs, putting them somewhere. The profiling results won't be rendered in HTML or in a human-readable format, we'll need another tool for this.
Workflow/job invoking pyinstrument, in a similar vein to this example. We can manually extract something like the cpu_time to use as a rough benchmarking estimate; relying on the Azure machines to be of reasonably similar spec, and can retain the profiling HTML files and publish these ourselves somewhere. Benchmarking won't be as accurate but provides the profiling information in a much more usable way, and doesn't require a dedicated machine.
Workflow/job running both ASV & pyinstrument. Give us the best of both, but it's a heavy compute cost and still requires a dedicated machine. Also, we'd have to investigate how the two HTML deployments play together.

The wgraham/asv-benchmark and wgraham/pyinstrument-profiling-ci branches have (locally working, still need to fix the broken tests!) implementations of both ASV and pyinstrument for the tasks above (on a 1month long simulation so the results get produced in ~2mins).

Opinions welcome: the github-pages branch of this repository is un-used so we can initially send the HTML outputs to there for viewing.

Some notes from meeting of @tamuri, @willGraham01 and myself today to discuss this issue

A possible simple system for storing and tracking profiling results is to push to either a dedicated branch within the main TLOmodel repository or in a separate repository, using simple nested directory structure for organizing results, similar to that created by the Julia BenchmarkCI.jl package (see example output for ParticleDA.jl repository).
- On balance we favoured using a separate repository, for now still under UCL organization named TLOmodel-outputs / TLOmodel-profiling or similar, as this would avoid downside of dedicated branch approach of possibly adding to issues around already large size of repository, and be inline with longer term aims of creating a TLOmodel organization and splitting up existing repository.
For each profiling run we should capture
- Raw profiler output (for example pyisession file for pyinstrument)
- Key parameters of run (for example initial population size, simulation length, Git commit ID)
- Some form of HTML / Markdown summary of profiling output for easy viewing on repository
- Potentially additional statistics such as memory usage of Python process, population dataframe size, main and HSI event queue sizes, tracked across run using a new logger event
For now we will just have the profiling outputs stored on the repository and use GitHub web interface as simple way for viewing summaries, while also allowing repository to be cloned locally to perform more detailed analysis
- Any Python scripts for analyzing profiling outputs across commits / time can be kept in new repository. At some point we can potentially move to automating running these with workflows within the repository.

[x] Developer repository setup
[x] Profiling runs workflow
- [x] Trigger on dispatch (#1012)
- [x] Push results to dev repo in folder-organised format (#1012)
- [x] Capture raw profiler output (#1012)
- [x] Action can be triggered on dispatch (#1104): https://github.com/UCL/TLOmodel/actions/runs/6170839940
- [x] Action can be triggered by comment. (#1097)
- [x] Capture HTML/ markdown summary (easier to load and re-render to reduce workflow bloat: https://github.com/UCL/TLOmodel-profiling/issues/4)

Statistics to potentially capture:

The kind of things to monitor;

[x] CPU time
[ ] Memory usage:
- [ ] Size of population dataframe (cols, rows, Mb); #1110
- [ ] Disk I/O - psutil.disk_io_counters might be the way to go
- [ ] Size of output files (TimH: what's the logging configuration on that scripts? which should we turn on and monitor? level info for all modules??);
- [ ] Memory usage by each module (data stored internally in the module)
[ ] How long different steps took e.g;
- [ ] Initialisation of population
- [ ] Simulation run time
- [ ] Log processing time
- [ ] Memory usage during each of these steps
[ ] How many times the population dataframe grows during the run #1110
[ ] Error rates
- [ ] How many warnings sent to stderr
[ ] Throughput
- [ ] How many secs each month takes during run as the simulation progresses
- [ ] Size of the event queue
- [ ] Size of the health interaction queue

File sizes of the pyisession outputs

NOTE: Even a 1-month simulation produces a pyisession file that is ~300MB, which is well above GitHub's 100MB standard limit. We can either:

Reduce the frame-interval (which will only be a temporary fix, but it's highly likely we don't need 1 frame/ms anyway).
Use Git LFS - will need to experiment how Git LFS works when trying to push files to another repo with it set up.
Host the results on our own server somewhere, push results to that server, and have the profiling build read/download from that server when building.

At some point, we can move the profiling repo into the TLOmodel org (https://github.com/TLOmodel).

Closing this as profiling workflow now capturing statistics and working reliably

UCL / TLOmodel