Open rtodling opened 6 months ago
I will add some thoughts/feedback here somewhat regarding this issue.
In #377, I'm adding a new configuration key cycling_varbc
, which would impact the 3dvar_atmos
and 3dfgat_atmos
suites' flow.cylc
file. So my first thought was to add this into suite_questions.yaml
, but then I wouldn't be able to access cycling_varbc
inside a task, get_observations.py
in this case. Hence, I added it to task_question.yaml
instead and not in suite_questions.yaml
. So I wasn't sure if the best approach is to have them in both places or just keep it in the task_questions.yaml
. Moving forward with the geos_marine
, which handles multiple models, I will include a turn_on_icemodel
switch and will encounter the same issue.
To be able to use the templated version{{cycling_varbc}}
in flow.cylc
, I had to add the exception part in create_experiment.py
where it takes a model level experiment.yaml
dictionary setting, and copies it to the first level, along with cycle_times
experiment_id
etc. The same approach was taken for the ensemble_*
configuration keys.
I overall like the task_questions.yaml
structure, makes it possible to track what config keys are used for which tasks and simplifies questionary creation, so this is more of a feedback.
A major caveat to any significant changes here is that task_questions.yaml
is tightly wired into the existing swell machinery at a low level, and changing this design would affect swell in some pretty fundamental ways. I think anything we do here would need broad consensus, significant testing, and likely a bit of parallel/forked development (e.g., try to implement a new thing twice -- once using the existing approach, and once using a revised architecture, and then comparing the solutions).
That said, I agree with @rtodling that I'm not a huge fan of the big task_questions.yaml
design, and if we decide to change it (which I think we should), the earlier we make the switch, the better to minimize accumulation of further technical debt.
I think a better approach is to stop storing internal data in YAML and instead store data in Python classes (especially, taking advantage of Python's dataclass
or maybe even an external library like Pydantic). The key advantages here are that:
(1) With a bit of effort, we can use Python's static type checking or the libraries themselves to enforce data validity, without having to write our own assertions.
(2) We can implement more sophisticated logic into default values (e.g., default values can be set dynamically based on other values);
(3) We can rely on inheritance to reduce the amount of information that each class has to store (and, more importantly, the number of places where a given piece of information would have to be updated).
Right now, these are just musings. I'm going to try to do a personal deep dive into how swell uses this information and do some experimentation with alternative design patterns. As I have more concrete ideas and design sketches, I'll post them here (and I would encourage others to post their ideas as well!).
Another quick thought (while it's fresh): The current design of the task_questions.yaml
and suite_questions.yaml
is fundamentally around questions, with each question having a list of model and suite assignments.
E.g., from suite_questions.yaml
:
cycle_times:
ask_question: True
default_value: defer_to_model
options: defer_to_model
models:
- all
suites:
- 3dfgat_atmos
- 3dvar_atmos
- 3dvar_cycle
- 3dvar
- convert_ncdiags
- forecast_geos
- hofx
- localensembleda
- ufo_testing
prompt: Enter the cycle times for this model.
type: string-check-list
I wonder if a better general design pattern is to somehow organize this around suites or tasks, such that for a given suite, you had a complete list of the corresponding questions. Pseudocode:
from dataclasses import dataclass
# General class for a SwellQuestion
@dataclass
class SwellQuestion():
dtype: str
default_value: str
options: str
prompt: str
# Add some common question methods here.
# Example of declaring a full SWELL question
cycle_times = SwellQuestion(
dtype = "string-check-list",
default_value = "defer_to_model",
options = "defer_to_model",
prompt = "Enter the cycle times for this model"
)
# Keep declaring other questions
start_cycle_point = SwellQuestion(...)
final_cycle_point = SwellQuestion(...)
runahead_limit = SwellQuestion(...)
model_components = SwellQuestion(...)
##############################
# Now, we move to defining suites, also as a class that defines, among other
# things, a list of questions associated with that suite.
@dataclass
class SwellSuite():
questions: list
# ...some other configs?
s3dfgat_atmos = SwellSuite(questions = [
cycle_times(), # Default behavior
start_cycle_point(...), # Modify the behavior somewhat
final_cycle_point(...), # Modify the behavior somewhat
runahead_limit, # Default behavior
model_components # Default behavior
])
This needs a lot more work, but the more I look at it and think about it, the more I like this overall design. I'll try to post a more complete example here as I come up with it; that will hopefully be easier to comment on, but maybe even this initial example could provide some inspiration.
That said, I agree with @rtodling that I'm not a huge fan of the big
task_questions.yaml
design, and if we decide to change it (which I think we should), the earlier we make the switch, the better to minimize accumulation of further technical debt.
I agree with the notion that if we are to make a major design philosophy change it's better to do it sooner. This would impact our CI workflow strategy too. We should definitely have a broader discussion with more GMAO developers/group leads. That way, people would feel involved and invest from the (second) beginning which would inherently help during the transition/adoption phase.
(1) With a bit of effort, we can use Python's static type checking or the libraries themselves to enforce data validity, without having to write our own assertions. (2) We can implement more sophisticated logic into default values (e.g., default values can be set dynamically based on other values);
These two sound very appealing to me. As @rtodling mentioned, static default has been a hassle. However, I would like to stick with a built-in functionality before we commit to another external library unless it benefit us quite a bit.
(3) We can rely on inheritance to reduce the amount of information that each class has to store (and, more importantly, the number of places where a given piece of information would have to be updated).
We were using inheritence earlier during the development stage but this kind of got out of control in terms of tracking what class is coming from what part of the code. Perhaps with a different design strategy that could be prevented?
With the current design, flow.cylc
dictates which tasks/scripts are used and hence which questions are asked depending on the task dependecies in the task_questions.yaml
(on top of the suite questions).
I am finding more and more that a single task_question file to handle all question that can ever be asked of swell for all possible tasks is a true bottle neck.
1) many questions for a particular task have nothing to do w/ another. Keep them in the same file causes confusion
2) the way things are also prevents generality; for example: the LETKF uses a cubed ensemble; the Variational analysis uses a lat-lon ensemble; the GetEnsemble task refers to a variable path_to_ensemble which means only one default can be set right - preventing us from being able to set defaults for two completely independent suites LETKF and Var. Sure there are ways around it be they become entwined w/ more variables needing to be defined w/o need.
3) default npc and npy can only be given as general settings - equal to all applications - but that will not be the case.
There're other examples of the inconvenience of this single all-purpose file. It would be nice to rethink this.