Open alessandrofelder opened 1 year ago
@sfmig would the suggested structure allow you to easily benchmark the sub-steps (read
, process
, run_main_tool
, save
) as well as the entire example_workflow
?
thanks for putting this together @alessandrofelder! Some comments below
would the suggested structure allow you to easily benchmark the sub-steps (read, process, run_main_tool, save) as well as the entire example_workflow?
Yes, I think that sounds like a good structure! We can have a suite of benchmarks that time each of these steps and also the full workflow. Just to clarify, read
, process
, etc would directly be API functions right?
Keep the scripts in a benchmark-workflows folder in this repo.
I'd suggest maybe just calling this folder workflows
? Even if these scripts would be a guide for benchmarks, maybe we could keep this brainglobe-scripts
repo more independent for now?
Maybe we should have an entirely different brainglobe-benchmarks or brainglobe-scripts repo instead?
I think a brainglobe-benchmarks
repo would be a good idea actually (similarly to what they do in astropy). Even if that means re-doing some of the asv bits we already have... Right now the benchmarks we have follow the modules (we have a benchmark for imports and a benchmark for the tools
module), whereas I think we would prefer to define benchmarks around workflows instead.
read, process, etc would directly be API functions right?
Hopefully, yes - but I don't know. They will at least be conceptual groupings of several API functions and maybe also some generic functions like some image filters from skimage
or so. Like the conceptual process
would actually be applying some transformation and then thresholding and then converting the type of the image (or something like that) - this will crystallise when we write the actual scripts I guess?
I'd suggest maybe just calling this folder
workflows
? Even if these scripts would be a guide for benchmarks, maybe we could keep thisbrainglobe-scripts
repo more independent for now?
Yea this naming suggestion was tied to the brainglobe-workflows
. What is the advantage of having a separate brainglobe-benchmarks
repo compared to having a benchmarks
subfolder here? (Happy to be convinced of this - just already worried about the number of repos!)
I think we would prefer to define benchmarks around workflows instead.
I think so too. Let's do that!
read, process, etc would directly be API functions right?
Actually, maybe, if they are not yet directly API functions, they should become API functions?
Actually, maybe, if they are not yet directly API functions, they should become API functions?
yes, ideally the process would be:
Discussion of workflows general structure with @alessandrofelder
Why an environment variable to point to the config file?
Small thought experiment trying to predict ~what the GH actions yaml would look like depending on whether we make the config file an optional CLI argument or an environmental variable - having an optional CLI argument makes the job slightly shorter and more explicit, I think.
This is one of our main use cases, so I am classing this as a pro for the optional CLI argument.
jobs:
if: is_local # !!!! not actual GH actions syntax !!!
example_matrix:
strategy:
matrix:
brainreg-config:
[
/media/hard-drive/large-data-1/config.json,
~/large-data-2/config.json
]
brainreg_large_data_job:
runs-on: ubuntu-latest
env:
BRAINREG_CONFIG_PATH: {{ $brainreg-config }}
steps:
- name: "Run brainreg workflow"
run: python brainreg/brainreg-workflow.py
jobs:
if: is_local
example_matrix:
strategy:
matrix:
brainreg-config:
[
/media/hard-drive/large-data-1/config.json,
~/large-data-2/config.json
]
brainreg_large_data_job:
runs-on: ubuntu-latest
steps:
- name: "Run brainreg workflow"
run: python brainreg/brainreg-workflow.py --config {{ $brainreg-config}}
pros optional CLI argument v env var:
$ python brainreg-workflow.py --config /media/local_brainreg_data/config.json
cons optional CLI argument v env var:
other:
default_config.json
and custom_config.json
will have slightly different entries: the former will have a pooch URL, the latter will have path(s) to the file system argparse
are scary to new usersconfusing to have several different CLI entry points to same tool
Both env vars and argparse are scary to new users
Are these that relevant here? This is code that 99% of users won't ever see.
Also worth bearing in mind that the large data workflows won't run on ubuntu-latest
they will run on our internal CI runner. This will be triggered slightly differently to the cloud-based workflows.
Is your feature request related to a problem? Please describe. We'd like to have Python scripts that execute some key workflows discussed in the developer meeting on 14/09/2023. These are
bg-atlasapi
brainglobe/brainglobe-workflows#13We can farm out the writing of each of these scripts to a separate issue, once we've agreed on what requirements we have for them.
Describe the solution you'd like I propose the requirements are (based on dev meeting discussion)
The main purpose of these scripts is benchmarking and smoke-testing (on large data). It would be nice if they would be re-usable in teaching materials/docs/tutorials.
Naively suggested structure
benchmark-workflows
folder in this repo.def example_workflow(): input_data_path = Path(os.environ("BENCHMARK_DATA")) assert input_data_path.exists() input_data = read(input_data_path) preprocess_data = process(input_data) # if required results = run_main_tool(prepocessed_data) save(results)
if name == "main": example_workflow()