(SE) `task_question.yaml`: a design issue

GEOS-ESM / swell

Workflow system for coupled data assimilation applications

https://geos-esm.github.io/swell/

Apache License 2.0

15 stars 4 forks source link

(SE) `task_question.yaml`: a design issue #314

Open rtodling opened 6 months ago

rtodling commented 6 months ago

I am finding more and more that a single task_question file to handle all question that can ever be asked of swell for all possible tasks is a true bottle neck.

1) many questions for a particular task have nothing to do w/ another. Keep them in the same file causes confusion

2) the way things are also prevents generality; for example: the LETKF uses a cubed ensemble; the Variational analysis uses a lat-lon ensemble; the GetEnsemble task refers to a variable path_to_ensemble which means only one default can be set right - preventing us from being able to set defaults for two completely independent suites LETKF and Var. Sure there are ways around it be they become entwined w/ more variables needing to be defined w/o need.

3) default npc and npy can only be given as general settings - equal to all applications - but that will not be the case.

There're other examples of the inconvenience of this single all-purpose file. It would be nice to rethink this.

Dooruk commented 2 months ago

I will add some thoughts/feedback here somewhat regarding this issue.

In #377, I'm adding a new configuration key cycling_varbc, which would impact the 3dvar_atmos and 3dfgat_atmos suites' flow.cylc file. So my first thought was to add this into suite_questions.yaml, but then I wouldn't be able to access cycling_varbc inside a task, get_observations.py in this case. Hence, I added it to task_question.yaml instead and not in suite_questions.yaml. So I wasn't sure if the best approach is to have them in both places or just keep it in the task_questions.yaml. Moving forward with the geos_marine, which handles multiple models, I will include a turn_on_icemodel switch and will encounter the same issue.

To be able to use the templated version{{cycling_varbc}} in flow.cylc, I had to add the exception part in create_experiment.py where it takes a model level experiment.yaml dictionary setting, and copies it to the first level, along with cycle_times experiment_id etc. The same approach was taken for the ensemble_* configuration keys.

I overall like the task_questions.yaml structure, makes it possible to track what config keys are used for which tasks and simplifies questionary creation, so this is more of a feedback.

ashiklom commented 2 months ago

A major caveat to any significant changes here is that task_questions.yaml is tightly wired into the existing swell machinery at a low level, and changing this design would affect swell in some pretty fundamental ways. I think anything we do here would need broad consensus, significant testing, and likely a bit of parallel/forked development (e.g., try to implement a new thing twice -- once using the existing approach, and once using a revised architecture, and then comparing the solutions).

That said, I agree with @rtodling that I'm not a huge fan of the big task_questions.yaml design, and if we decide to change it (which I think we should), the earlier we make the switch, the better to minimize accumulation of further technical debt.

I think a better approach is to stop storing internal data in YAML and instead store data in Python classes (especially, taking advantage of Python's dataclass or maybe even an external library like Pydantic). The key advantages here are that:

(1) With a bit of effort, we can use Python's static type checking or the libraries themselves to enforce data validity, without having to write our own assertions.

(2) We can implement more sophisticated logic into default values (e.g., default values can be set dynamically based on other values);

(3) We can rely on inheritance to reduce the amount of information that each class has to store (and, more importantly, the number of places where a given piece of information would have to be updated).

Right now, these are just musings. I'm going to try to do a personal deep dive into how swell uses this information and do some experimentation with alternative design patterns. As I have more concrete ideas and design sketches, I'll post them here (and I would encourage others to post their ideas as well!).

ashiklom commented 2 months ago

Another quick thought (while it's fresh): The current design of the task_questions.yaml and suite_questions.yaml is fundamentally around questions, with each question having a list of model and suite assignments.

E.g., from suite_questions.yaml:

cycle_times:
  ask_question: True
  default_value: defer_to_model
  options: defer_to_model
  models:
  - all
  suites:
  - 3dfgat_atmos
  - 3dvar_atmos
  - 3dvar_cycle
  - 3dvar
  - convert_ncdiags
  - forecast_geos
  - hofx
  - localensembleda
  - ufo_testing
  prompt: Enter the cycle times for this model.
  type: string-check-list

I wonder if a better general design pattern is to somehow organize this around suites or tasks, such that for a given suite, you had a complete list of the corresponding questions. Pseudocode:

from dataclasses import dataclass

# General class for a SwellQuestion
@dataclass
class SwellQuestion():
    dtype: str
    default_value: str
    options: str
    prompt: str

    # Add some common question methods here.

# Example of declaring a full SWELL question
cycle_times = SwellQuestion(
    dtype = "string-check-list",
    default_value = "defer_to_model",
    options = "defer_to_model",
    prompt = "Enter the cycle times for this model"
)

# Keep declaring other questions
start_cycle_point = SwellQuestion(...)
final_cycle_point = SwellQuestion(...)
runahead_limit = SwellQuestion(...)
model_components = SwellQuestion(...)

##############################

# Now, we move to defining suites, also as a class that defines, among other 
# things, a list of questions associated with that suite.

@dataclass
class SwellSuite():
    questions: list
    # ...some other configs?

s3dfgat_atmos = SwellSuite(questions = [
    cycle_times(),           # Default behavior
    start_cycle_point(...),  # Modify the behavior somewhat
    final_cycle_point(...),  # Modify the behavior somewhat
    runahead_limit,          # Default behavior
    model_components         # Default behavior
])

This needs a lot more work, but the more I look at it and think about it, the more I like this overall design. I'll try to post a more complete example here as I come up with it; that will hopefully be easier to comment on, but maybe even this initial example could provide some inspiration.

Dooruk commented 2 months ago

That said, I agree with @rtodling that I'm not a huge fan of the big task_questions.yaml design, and if we decide to change it (which I think we should), the earlier we make the switch, the better to minimize accumulation of further technical debt.

I agree with the notion that if we are to make a major design philosophy change it's better to do it sooner. This would impact our CI workflow strategy too. We should definitely have a broader discussion with more GMAO developers/group leads. That way, people would feel involved and invest from the (second) beginning which would inherently help during the transition/adoption phase.

(1) With a bit of effort, we can use Python's static type checking or the libraries themselves to enforce data validity, without having to write our own assertions. (2) We can implement more sophisticated logic into default values (e.g., default values can be set dynamically based on other values);

These two sound very appealing to me. As @rtodling mentioned, static default has been a hassle. However, I would like to stick with a built-in functionality before we commit to another external library unless it benefit us quite a bit.

(3) We can rely on inheritance to reduce the amount of information that each class has to store (and, more importantly, the number of places where a given piece of information would have to be updated).

We were using inheritence earlier during the development stage but this kind of got out of control in terms of tracking what class is coming from what part of the code. Perhaps with a different design strategy that could be prevented?

With the current design, flow.cylc dictates which tasks/scripts are used and hence which questions are asked depending on the task dependecies in the task_questions.yaml (on top of the suite questions).