davidorme commented 3 months ago

Description

This PR is to resolve the profiling issues described in #207. It got a bit larger than expected, but there were a few interlinked issues to give clean profiling and benchmarking workflows.

Updates broken data path in splash profiling test.
Updates profiling CI steps to remove reference to removed lfs step and to run only on one OS/python version.
Renames report.py to a more descriptive run_benchmarking.py and revises the command line structure.
Updates the benchmarking functions:
- Adds a new function get_function_map that uses ast to label profiled processes by their position in the package tree. This is mostly so that repeated method names within a source file can told apart by their class name and not their line number (which is unstable across versions).
- Merges read_prof_as_dataframe and process_report_columns into convert_and_filter_prof_file: these functions would always need to be run together.
- The plot_profiling and plot_benchmark have now been merged into create_benchmark_plot, which is a single plot showing relative performance of individual calls across versions.
- Adds a generate_test_database function, which is mostly just to keep a handy recipe for test inputs when we need to revise this process.
- Reworked the CLI arguments to provide cleaner manual and automated workflows.
- Benchmarking is now carried out on all package callables - we might need to alter this, but starting with everything.
Updates the description of the process in CONTRIBUTING.md, including manual and automated profiling workflow.
We need a separate issue to split CONTRIBUTING.md up and avoid content duplication with the docs/development content. (ETA: #209 )
Moved the update of the call graph version in the repo (prof/combined.svg --> profiling/call-graph.svg) out of report.py and into a separate CI step - keeps benchmarking code more focussed.
Update the profiling test structures and tiles the data to increase the run time.
Moves the profiling actions into their own workflow that only runs on pulls into develop and main, so only when PRs get merged. This isn't perfect - a merge might turn out to break profiling - but having the profiling on every commit to a PR is too bulky.

I have a broader concern that this benchmarking is going to be continually throwing up issues that arise from different runner specs, but we'll just have to see.

But - that aside: does this all look sane? Does the new graph make sense? Actually, something is wrong there as the plot from a previous failed run hasn't been replaced by the most recent passing run.

@tztsai - it would be great if you could have a look at this, but I realise you're on another project.

Type of change

[ ] New feature (non-breaking change which adds functionality)
[ ] Optimization (back-end change that speeds up the code)
[x] Bug fix (non-breaking change which fixes an issue)

Key checklist

[x] Make sure you've run the pre-commit checks: $ pre-commit run -a
[x] All tests pass: $ poetry run pytest

Further checks

[x] Code is commented, particularly in hard-to-understand areas
[ ] Tests added that prove fix is effective or that feature works

codecov-commenter commented 3 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 95.17%. Comparing base (971d1a3) to head (1a99181).

Additional details and impacted files

```diff @@ Coverage Diff @@ ## develop #208 +/- ## ======================================== Coverage 95.17% 95.17% ======================================== Files 28 28 Lines 1701 1701 ======================================== Hits 1619 1619 Misses 82 82 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

davidorme commented 3 months ago

The actions are all now passing: https://github.com/ImperialCollegeLondon/pyrealm/actions/runs/8633260571

I think the auto-commit is triggering a new push that is stalling somehow. Those profiling auto-commits probably don't need any CI - we could add [skip ci] to the message?

davidorme commented 3 months ago

I think we’ve got issues with the benchmarking. I think it is now working as we intended but three runs of essentially the same code (only the profiling CI workflow is changing) is leading to wildly different relative run times. The two profiling graphs being updated in this commit (https://github.com/ImperialCollegeLondon/pyrealm/pull/208/commits/d6dff3497a1b3400e09b7225ac565be012e6c9b3) show the issue.

It could be that the profiling tests have too small a load - they run really fast - to give consistent behaviour or it could be that I’ve done some mad randomisation of the sort order. I don’t think that’s the case though - I manually triggered fails in testing by altering the database and the correct processes failed the benchmarking. My guess is that runner architecture is going to make this process hard to use within CI. My intuition is that we need a single benchmarking machine to run tests?

davidorme commented 3 months ago

Also note that - with the call graph copy in benchmarking job - the failed run_benchmarking.py has clobbered that line in the run section, so the call graph is not copied when benchmarking fails. It needs its own step.

MarionBWeinzierl commented 3 months ago

I would suggest that we should try going back to a bigger problem size, to make sure we exclude random noise being a major factor in the runtimes.

davidorme commented 3 months ago

I agree - I think we can simply tile the current inputs to increase the load. A couple of other things:

Those plots clearly show increasing variance from the top (longest running) to the bottom (shortest running) - which is basically just more noise in the quick running processes. I think the code before only looked at calls that took over a threshold value, but I wanted to try and get a more holistic view of execution time. We could put in some kind of tuneable runtime filter? Only include the processes that account for 95% of the total runtime or similar.
I think I've also switched from using the maximum previous value for a process as the target to the minimum. I don't like using the maximum but we could use another stat - maybe something to do with the range.

MarionBWeinzierl commented 3 months ago

As discussed on Slack, we might also want to reduce when this is run, i.e., only run on merge on develop or main

ImperialCollegeLondon / pyrealm

Fixing profiling issues #208

Description

Type of change

Key checklist

Further checks

Codecov Report