Closed sondreso closed 4 years ago
Based on some initial thoughts I've compiled the following with the time available:
I want to be able to create an evaluator like this:
evaluator = EnsembleEvaluator(environment, ensemble, event_hooks)
after which I would like to be able to do:
evaluator.evaluate(ensid=None) -> evalid
evaluator.evaluations -> [evalid]
stop(evalid) -> status_code
In addition, I would like the creation data to be something on the form of:
# environment
script_jobs:
-
name: sim_prep
exec: sim_prep.py
config:
sim_prep_mode: 42
active_good_stuff: True
scalars: [1, 2, 3]
-
name: simulator
exec: /path/to/my_sim.py
config:
fast: True
versions:
- 2019
- 2056
-
name: sim2csv
exec: sim2csv.py
steps:
-
name: simulate
type: script
content:
- sim_prep --do-fancy-stuff
- simulator stuff_to_simulate
- sum2csv summary-data.csv
queues:
-
name: lsf
scaling: infinite
-
name: local
scaling: 3
resources:
-
name: datafile
location: /path/to/datafile
(opaque data, loaded from files in first iteration)
# ensemble
forward model:
-
step: simulate
environment:
queue: lsf
container: res-komodo-stable
resources:
- datafile
responses:
- summary-data
[realization]:
variables (N-dim matrices with indicies)
# event_hook
data_hook -> realization_response (N-dim matrices with indicies)
status_hook -> waiting, running (progress), sucess, failure
I will argue that the Ensemble Evaluator API (EEAPI) should be a mini domain-specific language (mDSL) in Python. This is a fancy way of saying that the EEAPI will consist of a nice and elegant class library as you'd expect, but also offer higher level scripting wrapper around the class library. In my examples, this mDSL is a glorified builder API, but the point is that the consumer is not consuming the class library directly, which means the underlying implementation could expand/change dramatically without impacting the mDSL too much. An additional benefit, is that we can capture intent, which may be lost with a class library, which again may lead to more sophisticated validation. More on intent later.
An example:
# Pre-made constituents of the evaluator domain
onprem_hpc_executor = create_executor()
.prefer_hpc() # maps to a list of hpc hosts, defined by us
.only_onprem() # filters hpc list to only onprem (no cloud for you)
local_executor = create_executor()
local_storage = create_storage()
.connection_string("sqlite://{{project_dir}}/db.sqlite")
fmu_azure_blob_storage = create_storage()
.base_url("azure-blob://equinor/fmu/{{project_dir}}/")
.credentials(…)
We dogfood the mDSL and provide the user with ready-made constituents in the domain. This of course, enables the user/library author to create its own constituents. For example, I only want azure hpc:
azure_hpc_executor = create_executor()
.hosts(["azure10.equinor.com", "azure11.equinor.com"])
These executors are basically processes that run on some machine or other. They can and must be defined on either the evaluation as a whole, or on individual forward models (FM). This is because that's how the cloud works, we should not assume anything about the underlying metal.
Without domain functionality like .prefer_hpc()
and .only_onprem()
, some validation may go away, but we allow the user a lot of flexibility within the domain (i.e., we can't say that azure_hpc_executor
is actually HPC without .prefer_hpc()
). However, we could make a guess, and then do validation. So after some time, if we find that those user-defined hosts are strictly a subset of azure/hpc, we could deduce it. In the end, the implementation is insulated, allowed to vary because the consumer isn't accessing the class library directly.
Add some more constituents:
# User then defines project specific resources, normally via config/GUI
spe1_resources = create_resource()
.path("opm-data/spe1")
.source(fmu_azure_blob_storage)
.credentials(…)
eclipse_forward_model = create_forward_model()
.is_eclipse_simulation() # -> EclipseForwardModel
.with_eclipse_version(ECLIPSE.v2013) # -> Eclipse2013ForwardModel
.with_resource(spe1_resources)
.with_executor(onprem_hpc_executor)
.with_retry_on_failure(max_retries=3)
.on_success(lambda r: logging.INFO(r.status))
.sink(local_storage)
npv_forward_model = create_forward_model()
.depends(eclipse_forward_model)
.executable("npv.py)
.sink(fmu_azure_blob_storage)
The npv_forward_model
will only run after eclipse_forward_model
has run. Implicitly, a the NPV FM will run source(eclipse_forward_model)
and the eclipse simulation will run sink(npv_forward_model)
. In other words, the eclipse data will end up both in local storage, and in azure, but in azure it's sole purpose is to provide the NPV FM with data.
evaluator = create_evaluator() # EmptyEvaluator
.with_realizations(10)
.set_variables(…)
.set_parameters(…)
.add_forward_model(npv_forward_model) # other jobs are derived from the dep tree of the npv job
evaluator.run()
The most applicable argument is for an mDSL is that there already is a domain language for the EE—built carefully by the FMU gang/ERT developers—but its constituents and functions are seldom defined explicitly in the code (e.g. the EE is itself not a class, but a set of C and Python codes), and are often ill-defined (e.g. the FORWARD_MODEL
configuration keyword spans janitorial tasks as creating folders, but also three weeks of Eclipse simulations). Explicitly defining all of these constituents and functions in a class library is a good start, but, as I mentioned, intent may not be captured.
Imagine that I run $ ert run_test_experiement
. My ensemble is evaluated. I would like to add a modelling step, or analysis step in my ensemble. E.g. massage and dump some data for creation of a tornado plot at the end of the evaluation.
$ ert evaluator add_forward_model tornado.yml --as-last-thing
I'd really like this to mean that only the tornado job is run if I re-run ert run_test_experiment
. I'd like ERT to know what it can and cannot do, by capturing user intent. But this is based on extrapolating from my incomplete data on how users use ERT. I just know from my own experiences, that this kind of analytical/highly speculative business has to be done iteratively. Changes to a large config does not really enable that iterative style. A mDSL might cater for tooling that does.
(As you see, I've not focused on config, because to me, a config is a contract between ERT and the user, and shouldn't really impact the api design of the evaluator.)
Contrary to @jondequinor I have mainly focused on the configuration, but I agree on the point that this should not impact the API design. Anyways, I will post my idea for the configuration, so the ideas are not lost. (I have rewritten it to use the same "examples" as @markusdregi to make it easier to compare)
queue_system:
default_driver: local
local:
max_submit: 50
lsf:
queue: mr
max_submit: 1
jobs:
-
name: sim_prep
task:
type: executable
executable: sim_prep.py
settings:
sim_prep_mode: 42
active_good_stuff: True
scalars: [1, 2, 3]
-
name: simulator
task:
type: executable
executable: /path/to/my_sim.py
settings:
fast: True
versions:
- 2019
- 2056
-
name: sim2csv
task:
type: executable
executable: sim2csv.py
steps:
-
name: simulate
type: script
content:
- sim_prep --do-fancy-stuff
- simulator stuff_to_simulate
- sum2csv summary-data.csv
input:
- name: data_file
type: file/binary
- name: magic_csv_file
type: file/csv
- name: poro_csv_file
type: file/csv
output:
- name: summary
type: csv
-
name: mock_evaluation
type: python_function
module: some.python.module
function: polynomial_mock
input: # Will be given as open streams to the function
- name: data_file
type: file/binary
- name: magic_csv_file
type: file/csv
output: # return open stream
- name: summary
type: csv
resources:
-
name: datafile
location: /path/to/datafile
-
name: my_csv_file
location: /path/to/some_csv_file
variables:
-
name: param_a
dimensions: [5, 6]
distribution:
type: triangular
params:
min: 4
max: 10
mode: 5
ensemble:
realizations: 150
forward model:
-
step: simulate
environment:
queue: lsf
container: res-komodo-stable
inputs: # Link expected inputs with resources/variables
-
input_name: data_file
resource: datafile
-
input_name: magic_csv_file
resource: my_csv_file
-
input_name: poro_csv_file
variable: param_a
type: csv
responses: # Link expected outputs with storage
-
name: summary-data
type: csv
source: summary-data.csv
I also made an attempt at an API, but it's far from a finished solution. (And after reading @jondequinor 's proposal I'm far from convinced that this is the correct approach)
from enum import Enum
class RealizationStatus(Enum):
UNINITIALIZED = 1
INITIALIZED = 2
EVALUATING = 3
EVALUATION_FAILURE = 4
EVALUATION_SUCCESS = 5
class EventType(Enum):
DATA = 1
STATUS = 2
class Step:
# Very unfinished, struggling with how to define input/output
def __init__(self):
pass
def run(self):
pass
class Ensemble:
def __init__(self, variables, resources, forward_model, num_realizations):
pass
# Point illustrated by this function is that an ensemble might
# have some realizations which are not executed yet,
# but will be if the ensemble is evaluated again
def realization_status(self):
return [] # Return list of RealizationStatus'
class Evaluator:
def __init__(self, runtime_environments):
pass
def evaluate(self, ensemble, event_hook, realizations=None):
pass
def my_hook(event):
if event.type == EventType.DATA:
# some data stuff
pass
elif event.type == EventType.STATUS:
# some status stuff
pass
On the topic of input/output:
One thing that I struggle to address both in the configuration and the API, is how to link expected input(s) from a step (we know that an Eclipse job needs a data file) with resources/variables/output from other jobs (we want to give some_model/HELLO.DATA
as input to an eclipse simulation) and how to link expected output(s) with outputs that should be used in e.g. update step. This gets especially complicated if we are running more than one instance of a job, or a job have multiple inputs of the same type (a-b
is not the same as b-a
, so we need to link to outputs from job to the correct input "port" of the next job). It's difficult to make these relations explicit without growing the configuration file significantly.
@jondequinor, I like flexibility and expressiveness a mDSL introduces!
Some questions:
I just know from my own experiences, that this kind of analytical/highly speculative business has to be done iteratively. Changes to a large config does not really enable that iterative style. A mDSL might cater for tooling that does.
In this scenario, are the users still expected to interact with the config file, or directly with the mDSL? I think the latter will be hard to sell to the average user, but I might be wrong.
eclipse_forward_model = create_forward_model() .is_eclipse_simulation() # -> EclipseForwardModel .with_eclipse_version(ECLIPSE.v2013) # -> Eclipse2013ForwardModel
It's not entirely clear to me how this will integrate with our current (and future) plugin system. The creation of a eclipse forward model would need to be encapsulated within the plugin, which I think (?) implicitly gives the same constraints as the Class approach. (The assumption here is that the user interacts with the configuration file and not the mDSL)
.with_resource(spe1_resources) .with_executor(onprem_hpc_executor) .with_retry_on_failure(max_retries=3) .on_success(lambda r: logging.INFO(r.status)) .sink(local_storage)
I think we need to refine how we do input/output to jobs, ref the "On the topic of input/output"-section of my suggestions. We need to be more explicit about what input we are linking to what output from the previous job, especially in the case of multiple inputs and outputs.
Imagine that I run
$ ert run_test_experiement
. My ensemble is evaluated. I would like to add a modelling step, or analysis step in my ensemble. E.g. massage and dump some data for creation of a tornado plot at the end of the evaluation.
$ ert evaluator add_forward_model tornado.yml --as-last-thing
I'd really like this to mean that only the tornado job is run if I re-run
ert run_test_experiment
. I'd like ERT to know what it can and cannot do, by capturing user intent.
I think this touches a very difficult subject: What do we allow to change before an ensemble is considered too different to another ensemble, such that it's no longer an extension/addition of the original ensemble, but instead represents an edit? I like the idea of an immutable ensemble, and I'm not sure the best way to balance this with interactive development of your ensemble.
My thought here was that for everything that is created, a handle is returned to that resource. If we want to rerun, I am thinking we need to make a new ensemble, but this can probably be hidden from a user. I am in favor of an append only solution, if that can work. So we would create a new ensable based on the failed one, and be smart about not repeating steps that had completed successfully.
def test_ensemble_api(environment, storage):
project = environment.create_project("my_project")
res1 = project.add_resource("file:///home/ole/eclipse/project1/file1")
res2 = project.add_resource("https://blobby-things/project1/file2")
job1 = project.create_job(
name="run_eclipse",
executable="/usr/bin/eclipse",
provided="true",
parameters=["param1", "param2"],
std_out_capture="eclipse_out3")
job2 = project.create_job(
name="sim2csv",
executable="sim2csv.py"
)
# each part of a step is defined separately
# inputs and outputs are identified, optionally with a separate name if file-name is long
part1 = project.create_step_part(
name="simulate",
cmd=["run_eclipse --in=input1 --params=param1,param2 --out eclipse/output/SOME_FILE1"],
inputs=["input1",
"input2",
"param1",
"param2"],
outputs=[("eclipse_out1", "eclipse/output/SOME_FILE1"),
("eclipse_out2", "eclipse/output/SOME_FILE2")],
)
part2 = project.create_step_part(
name="sim2csv1",
parts=["sim2csv eclipse_out1 out1"],
inputs=["eclipse_out1"],
outputs=["out1"]
)
part3 = project.create_step_part(
name="sim2csv2",
parts=["sim2csv eclipse_out2 out2"],
inputs=["eclipse_out2"],
outputs=["out2"]
)
# a step is made from several parts
step = project.create_step([part1, part2, part3])
# when making an ensemble, the resources that are the same across all realizations are given
ensemble = project.create_ensemble(
realizations=2,
jobs=[job1, job2],
steps=[step],
common_resources=[(res1, "input1"), (res2, "input2")])
param1_1 = storage.get_parameter_ref("myproject","param1", 0)
param1_2 = storage.get_parameter_ref("myproject", "param2", 0)
param2_1 = storage.get_parameter_ref("myproject","param1", 1)
param2_2 = storage.get_parameter_ref("myproject", "param2", 1)
# parameters are specific to each realization, so they are added separately
ensemble.add_node_resource(realization=0, ref=param1_1, name="param1")
ensemble.add_node_resource(realization=0, ref=param1_2, name="param2")
ensemble.add_node_resource(realization=1, ref=param2_1, name="param1")
ensemble.add_node_resource(realization=1, ref=param2_2, name="param2")
count_start = 0
count_finished = 0
def callback(event):
if event.job_started:
count_start += 1
if event.job_finished:
count_finished += 1
ensid = ensemble.evaluate(callback)
ensemble.wait()
assert count_start == 2
assert count_finished == 2
# getting a resource out as a handle, that can be sent to other parts of the application, like storage, assuming it can consume it
out_ref1 = ensemble.get_node_resource_ref(realization=0, name="out1")
out_ref1 = ensemble.get_node_resource_ref(realization=0, name="out2")
storage.store_result("myproject", "result1", out_ref1, realiszation=0)
storage.store_result("myproject", "result2", out_ref2, realiszation=0)
An experiment in defining a step with a special purpose dsl to make it more consise. The idea is that input and output for each part of a step declares which input it consumes, and what it produces. Input in [] and output in {}. If it is used directly in the command, it is in the position of it's use. If it is not explisit, it is declared after a %.
steps:
-
type: script
defintion:
name: simulate
contents:
- run_eclipse --in=[input1] --params=[param1],[param2] --out {eclipse_out1} % [input2] {eclipse_out2}
- sim2csv [eclipse_out1] {out1:csv}
- sim2csv [eclisee_out2] {out2:csv}
In this scenario, are the users still expected to interact with the config file, or directly with the mDSL? I think the latter will be hard to sell to the average user, but I might be wrong.
No, the users will probably not write these. However, that the business could write is one thing—read is a whole other thing. To quote Martin Fowler:
I do think that the greatest potential benefit of DSLs comes when business people participate directly in the writing of the DSL code. The sweet spot, however is in making DSLs business-readable rather than business-writeable. If business people are able to look at the DSL code and understand it, then we can build a deep and rich communication channel between software development and the underlying domain.
[code instantiating things that come from a plugin]
It's not entirely clear to me how this will integrate with our current (and future) plugin system. The creation of a eclipse forward model would need to be encapsulated within the plugin, which I think (?) implicitly gives the same constraints as the Class approach. (The assumption here is that the user interacts with the configuration file and not the mDSL)
Good point. The EclipseForwardModel
would not be the basis for this part of the DSL, it would have to be more abstract.
[code that hand wavy defines a sink]
I think we need to refine how we do input/output to jobs, ref the "On the topic of input/output"-section of my suggestions. We need to be more explicit about what input we are linking to what output from the previous job, especially in the case of multiple inputs and outputs.
I totally agree. This isn't a new problem. I've done some research, and I haven't yet found good ways to organize this. The only way forward that I can think of, is to at least accept that it is truly complex and try to find a way. The current situation is, of course, intolerable as it is implicit coordination (job a produces x, job b excepts x to exist, other than that—both jobs are completely disjoint in every way).
So; yes, let's continue talking about this. There is no good solution to this problem in any of my comments thus far.
[imagine mutable, "smart" ensembles]
I think this touches a very difficult subject: What do we allow to change before an ensemble is considered too different to another ensemble, such that it's no longer an extension/addition of the original ensemble, but instead represents an edit? I like the idea of an immutable ensemble, and I'm not sure the best way to balance this with interactive development of your ensemble.
I concede that this is a bad idea.
However, I still think we should adopt the mindset that changes over time (within a domain like this) are best modelled as transactional logs.
I think this discussion is very good! I'll try to persist some reflections I've done:
Separation of construction and consumption
I think it is a good idea to clearly separate the responsibility of constructing and consuming an EnsembleEvaluator
. And furthermore, put the responsibility of defining the runtime environment, the forward model and the input data entirely on the one constructing it. And hence, the responsibility of the consumer is only to initiate evaluations (for parts of or the entire ensemble), monitor evaluations and respond to the status as well as the results of the evaluations. This clearly separates the responsibility of defining what is to be evaluated
and running evaluations and gathering the results
.
Note that the above approach has the benefit of making the consumption API the one that is widely used, while the construction DSL will be rather contained. This is quite beneficiary in my opinion as then we isolate the difficult part that will need many iterations while widely spread something that we might get mostly right from the start 🤷
Input and output
I think the responsibility of defining input and output should reside on the various steps
. Furthermore, output can depending on the step type either be defined within the step (by utilising the reporter) or by the configuration of the step itself (please pick up this file after I'm done). The motivation for defining input is to allow for parallel execution and up-front validation of your data pipeline. If input is not defined, it will be assumed that all output generated so far is input and hence only sequential execution is possible.
On the note of input and output I think that part of the workflow manager's responsibility is to facilitate the data flow. Hence, although a step might produce a
the next step might expect that data as b
. I think we need a data manipulation
step type that allows us to describe thin layers of glue that makes the pipeline run. It should not be so that a result name from one step leaks into another step as input.
Workshop has ended, and the discussion is summarized here: https://github.com/equinor/ert/issues/1032
Create pseudo code/playground for how we want to use the Ensemble Evaluator API
Points: