[Feature] benchmark / smoke-test scripts: planning/structure

alessandrofelder commented 1 year ago

Is your feature request related to a problem? Please describe. We'd like to have Python scripts that execute some key workflows discussed in the developer meeting on 14/09/2023. These are

cell detection brainglobe/brainglobe-workflows#10
image registration from sample to atlas space brainglobe/brainglobe-workflows#11
transforming data from sample to atlas space brainglobe/brainglobe-workflows#12
accessing atlas data through the bg-atlasapi brainglobe/brainglobe-workflows#13
[possibly others, and combinations of these, but let's focus on these first?]

We can farm out the writing of each of these scripts to a separate issue, once we've agreed on what requirements we have for them.

Describe the solution you'd like I propose the requirements are (based on dev meeting discussion)

Simple, legible python scripts
Can be run on both big, real-life data and small test data
Maybe build in a way so parts of the code can be re-used within a jupyter notebook (for teaching materials/tutorials?)

The main purpose of these scripts is benchmarking and smoke-testing (on large data). It would be nice if they would be re-usable in teaching materials/docs/tutorials.

Naively suggested structure

Keep the scripts in a benchmark-workflows folder in this repo.
have this as an optional dependency for this package
Use environmental variable to pass the path to the test data

Code would look something like:


# pseudo-code approximating Python
# function names made up
import os
from pathlib import Path

def example_workflow(): input_data_path = Path(os.environ("BENCHMARK_DATA")) assert input_data_path.exists() input_data = read(input_data_path) preprocess_data = process(input_data) # if required results = run_main_tool(prepocessed_data) save(results)

if name == "main": example_workflow()


* maybe [`%load file.py` can be used to have `example_workflow` in a jupyter notebook](https://stackoverflow.com/questions/21034373/how-to-load-edit-run-save-text-files-py-into-an-ipython-notebook-cell)

**Describe alternatives you've considered**
* Maybe we should have an entirely different `brainglobe-benchmarks` or `brainglobe-scripts` repo instead? (As much as I'd like to reduce the number of repos!) - we can still do this at a later point, I guess (with a bit more effort...)

alessandrofelder commented 1 year ago

@sfmig would the suggested structure allow you to easily benchmark the sub-steps (read, process, run_main_tool, save) as well as the entire example_workflow?

sfmig commented 1 year ago

thanks for putting this together @alessandrofelder! Some comments below

would the suggested structure allow you to easily benchmark the sub-steps (read, process, run_main_tool, save) as well as the entire example_workflow?

Yes, I think that sounds like a good structure! We can have a suite of benchmarks that time each of these steps and also the full workflow. Just to clarify, read, process, etc would directly be API functions right?

Keep the scripts in a benchmark-workflows folder in this repo.

I'd suggest maybe just calling this folder workflows? Even if these scripts would be a guide for benchmarks, maybe we could keep this brainglobe-scripts repo more independent for now?

Maybe we should have an entirely different brainglobe-benchmarks or brainglobe-scripts repo instead?

I think a brainglobe-benchmarks repo would be a good idea actually (similarly to what they do in astropy). Even if that means re-doing some of the asv bits we already have... Right now the benchmarks we have follow the modules (we have a benchmark for imports and a benchmark for the tools module), whereas I think we would prefer to define benchmarks around workflows instead.

alessandrofelder commented 1 year ago

read, process, etc would directly be API functions right?

Hopefully, yes - but I don't know. They will at least be conceptual groupings of several API functions and maybe also some generic functions like some image filters from skimage or so. Like the conceptual process would actually be applying some transformation and then thresholding and then converting the type of the image (or something like that) - this will crystallise when we write the actual scripts I guess?

I'd suggest maybe just calling this folder workflows? Even if these scripts would be a guide for benchmarks, maybe we could keep this brainglobe-scripts repo more independent for now?

Yea this naming suggestion was tied to the brainglobe-workflows. What is the advantage of having a separate brainglobe-benchmarks repo compared to having a benchmarks subfolder here? (Happy to be convinced of this - just already worried about the number of repos!)

I think we would prefer to define benchmarks around workflows instead.

I think so too. Let's do that!

alessandrofelder commented 1 year ago

read, process, etc would directly be API functions right?

Actually, maybe, if they are not yet directly API functions, they should become API functions?

adamltyson commented 1 year ago

Actually, maybe, if they are not yet directly API functions, they should become API functions?

yes, ideally the process would be:

Write workflow
Identify pain points due to poor API
Improve API
Rewrite workflow with new API
Be confident that the API rewrite didn't have any knock-on effects due to comprehensive testing/benchmarking infrastructure

sfmig commented 12 months ago

Discussion of workflows general structure with @alessandrofelder

we want to define workflows that can fetch data from GIN and also locally
a workflow needs an input configuration which contains the required parameters
the input config is set to some default values, but can be customised via an environment variable
so the input configuration is defined:
- either via a default, if no environment variable is defined
- via an environment variable that points to a json file with the required parameters. We are undecided on where this json file should live: hardcoded (in a constant or in a local dict variable), or in a json in the repo, or on GIN

sfmig commented 12 months ago

Why an environment variable to point to the config file?

avoid writing argparse code
confusing to add an additional CLI (on top of the existing ones for cellfinder, brainreg)
CLI is most useful for users, whereas we are aiming this to be mostly non-user facing code, that can run independently in a dev environment without much setup but that developers can also customise to work locally
if we go CLI, we would have to hardcode CLI arguments into our calls to GH runners

alessandrofelder commented 12 months ago

Small thought experiment trying to predict ~what the GH actions yaml would look like depending on whether we make the config file an optional CLI argument or an environmental variable - having an optional CLI argument makes the job slightly shorter and more explicit, I think.

This is one of our main use cases, so I am classing this as a pro for the optional CLI argument.

jobs:
  if: is_local # !!!! not actual GH actions syntax !!!
  example_matrix:
    strategy:
      matrix:
        brainreg-config: 
        [
            /media/hard-drive/large-data-1/config.json,
            ~/large-data-2/config.json
        ]
  brainreg_large_data_job:
    runs-on: ubuntu-latest
    env:
      BRAINREG_CONFIG_PATH: {{ $brainreg-config }} 
    steps:
      - name: "Run brainreg workflow"
        run: python brainreg/brainreg-workflow.py

jobs:
  if: is_local
  example_matrix:
    strategy:
      matrix:
        brainreg-config: 
        [
            /media/hard-drive/large-data-1/config.json,
            ~/large-data-2/config.json
        ]
  brainreg_large_data_job:
    runs-on: ubuntu-latest
    steps:
      - name: "Run brainreg workflow"
        run: python brainreg/brainreg-workflow.py --config {{ $brainreg-config}}

alessandrofelder commented 12 months ago

pros optional CLI argument v env var:

more flexible because you can loop over config files in your system, without having to change the env var at every iteration (is this going to happen often or at all?)
more explicit about what is happening in the call: $ python brainreg-workflow.py --config /media/local_brainreg_data/config.json
slightly shorter YAML in GH actions file

cons optional CLI argument v env var:

extra arg-parse code to maintain
confusing to have several different CLI entry points to same tool

other:

Note that the default_config.json and custom_config.json will have slightly different entries: the former will have a pooch URL, the latter will have path(s) to the file system
can we avoid repetition by using the same config argument parser across workflows (if we go down the CLI route)?
what is a good way to fulfill the easy-to-benchmark requirement?
Both env vars and argparse are scary to new users

adamltyson commented 12 months ago

confusing to have several different CLI entry points to same tool

Both env vars and argparse are scary to new users

Are these that relevant here? This is code that 99% of users won't ever see.

Also worth bearing in mind that the large data workflows won't run on ubuntu-latest they will run on our internal CI runner. This will be triggered slightly differently to the cloud-based workflows.

brainglobe / brainglobe-workflows

[Feature] benchmark / smoke-test scripts: planning/structure #9