brainglobe / brainglobe-workflows

Workflows that utilise BrainGlobe tools to perform data analysis and visualisation.
BSD 3-Clause "New" or "Revised" License
9 stars 2 forks source link

[Feature] benchmark / smoke-test scripts: planning/structure #9

Open alessandrofelder opened 1 year ago

alessandrofelder commented 1 year ago

Is your feature request related to a problem? Please describe. We'd like to have Python scripts that execute some key workflows discussed in the developer meeting on 14/09/2023. These are

We can farm out the writing of each of these scripts to a separate issue, once we've agreed on what requirements we have for them.

Describe the solution you'd like I propose the requirements are (based on dev meeting discussion)

The main purpose of these scripts is benchmarking and smoke-testing (on large data). It would be nice if they would be re-usable in teaching materials/docs/tutorials.

Naively suggested structure

def example_workflow(): input_data_path = Path(os.environ("BENCHMARK_DATA")) assert input_data_path.exists() input_data = read(input_data_path) preprocess_data = process(input_data) # if required results = run_main_tool(prepocessed_data) save(results)

if name == "main": example_workflow()


* maybe [`%load file.py` can be used to have `example_workflow` in a jupyter notebook](https://stackoverflow.com/questions/21034373/how-to-load-edit-run-save-text-files-py-into-an-ipython-notebook-cell)

**Describe alternatives you've considered**
* Maybe we should have an entirely different `brainglobe-benchmarks` or `brainglobe-scripts` repo instead? (As much as I'd like to reduce the number of repos!) - we can still do this at a later point, I guess (with a bit more effort...)
alessandrofelder commented 1 year ago

@sfmig would the suggested structure allow you to easily benchmark the sub-steps (read, process, run_main_tool, save) as well as the entire example_workflow?

sfmig commented 1 year ago

thanks for putting this together @alessandrofelder! Some comments below

would the suggested structure allow you to easily benchmark the sub-steps (read, process, run_main_tool, save) as well as the entire example_workflow?

Yes, I think that sounds like a good structure! We can have a suite of benchmarks that time each of these steps and also the full workflow. Just to clarify, read, process, etc would directly be API functions right?

Keep the scripts in a benchmark-workflows folder in this repo.

I'd suggest maybe just calling this folder workflows? Even if these scripts would be a guide for benchmarks, maybe we could keep this brainglobe-scripts repo more independent for now?

Maybe we should have an entirely different brainglobe-benchmarks or brainglobe-scripts repo instead?

I think a brainglobe-benchmarks repo would be a good idea actually (similarly to what they do in astropy). Even if that means re-doing some of the asv bits we already have... Right now the benchmarks we have follow the modules (we have a benchmark for imports and a benchmark for the tools module), whereas I think we would prefer to define benchmarks around workflows instead.

alessandrofelder commented 1 year ago

read, process, etc would directly be API functions right?

Hopefully, yes - but I don't know. They will at least be conceptual groupings of several API functions and maybe also some generic functions like some image filters from skimage or so. Like the conceptual process would actually be applying some transformation and then thresholding and then converting the type of the image (or something like that) - this will crystallise when we write the actual scripts I guess?

I'd suggest maybe just calling this folder workflows? Even if these scripts would be a guide for benchmarks, maybe we could keep this brainglobe-scripts repo more independent for now?

Yea this naming suggestion was tied to the brainglobe-workflows. What is the advantage of having a separate brainglobe-benchmarks repo compared to having a benchmarks subfolder here? (Happy to be convinced of this - just already worried about the number of repos!)

I think we would prefer to define benchmarks around workflows instead.

I think so too. Let's do that!

alessandrofelder commented 1 year ago

read, process, etc would directly be API functions right?

Actually, maybe, if they are not yet directly API functions, they should become API functions?

adamltyson commented 1 year ago

Actually, maybe, if they are not yet directly API functions, they should become API functions?

yes, ideally the process would be:

sfmig commented 12 months ago

Discussion of workflows general structure with @alessandrofelder

sfmig commented 12 months ago

Why an environment variable to point to the config file?

alessandrofelder commented 12 months ago

Small thought experiment trying to predict ~what the GH actions yaml would look like depending on whether we make the config file an optional CLI argument or an environmental variable - having an optional CLI argument makes the job slightly shorter and more explicit, I think.

This is one of our main use cases, so I am classing this as a pro for the optional CLI argument.

jobs:
  if: is_local # !!!! not actual GH actions syntax !!!
  example_matrix:
    strategy:
      matrix:
        brainreg-config: 
        [
            /media/hard-drive/large-data-1/config.json,
            ~/large-data-2/config.json
        ]
  brainreg_large_data_job:
    runs-on: ubuntu-latest
    env:
      BRAINREG_CONFIG_PATH: {{ $brainreg-config }} 
    steps:
      - name: "Run brainreg workflow"
        run: python brainreg/brainreg-workflow.py
jobs:
  if: is_local
  example_matrix:
    strategy:
      matrix:
        brainreg-config: 
        [
            /media/hard-drive/large-data-1/config.json,
            ~/large-data-2/config.json
        ]
  brainreg_large_data_job:
    runs-on: ubuntu-latest
    steps:
      - name: "Run brainreg workflow"
        run: python brainreg/brainreg-workflow.py --config {{ $brainreg-config}} 
alessandrofelder commented 12 months ago

pros optional CLI argument v env var:

cons optional CLI argument v env var:

other:

adamltyson commented 12 months ago

confusing to have several different CLI entry points to same tool

Both env vars and argparse are scary to new users

Are these that relevant here? This is code that 99% of users won't ever see.

Also worth bearing in mind that the large data workflows won't run on ubuntu-latest they will run on our internal CI runner. This will be triggered slightly differently to the cloud-based workflows.