UCL-CCS / EasyVVUQ

Python 3 framework to facilitate verification, validation and uncertainty quantification (VVUQ) for a wide variety of simulations.
https://easyvvuq.readthedocs.io/
GNU Lesser General Public License v3.0
85 stars 27 forks source link

Should the campaign object be reimagined as a (sort of) finite state machine? #63

Closed raar1 closed 5 years ago

raar1 commented 5 years ago

Currently a sampling element, for example, will be created in the user's python script, and then will dump its generated runs into the campaign object. So in this case, the campaign is just a repository for runs, with extra logging capabilities.

This approach is problematic for cases with a very large number of samples (such as Jalal's case, with > 10^6 runs). In such cases, we only want to generate runs e.g. 100 at a time, so they can be encoded-executed-decoded and the output added to the final dataframe.

Adding 100 jobs at a time to the campaign, and then running those, seems clumsy. Especially since, if we want to stop the script part-way through, we lose the state that the sampling element had reached. So we can't restart.

However, if the campaign object was more a sort of finite state machine, then this might be possible. So, it might look more like:

my_campaign.set_encoder(AppEncoder())
my_campaign.set_decoder(AppDecoder())
my_campaign.set_execution_fn(execute)
my_campaign.set_sampler(PCESampler(order=2, ...))
my_campaign.set_analyser(PCEAnalyser())
while my_campaign.more_runs() == True:
        my_campaign.run(100)
my_campaign.analyse()
my_campaign.set_sampler(OtherSampler(blah))
my_campaign.set_analyser(OtherAnalyser())
my_campaign.run_all()
my_campaign.analyse()

So the Campaign object is now always set to be in a particular state (either sampling or analysing) and it runs the elements within itself, rather than those elements being external objects which act on the Campaign.

If we enforce that every EasyVVUQ element must have a "serialize" function implemented, this makes it easier to store the whole campaign object's state at once. Note that now it would also be storing the states of every element working within it.

It would still also perform logging duties etc.

This shape fits a lot more closely with something the PJM might work with, and could make Vytas' database simpler. I think it also makes the python script written by the user look a lot simpler too.

dww100 commented 5 years ago

In principle I like this idea a lot.

What does the new database look like?

raar1 commented 5 years ago

My idea is that the database will have a new, single slot for storing the element that is currently being applied.

So when you do my_campaign.set_element(PCESampler()) then Campaign.current_element is set to the newly created PCESampler instance. And since all elements will be forced to implement a .serialize() function (or equivalent), then whenever the campaign saves its state to database, then the exact current state of the element is saved as well.

This means Campaign databases can act as effective "restart" files for the VVUQ workflow. If the sampler only got through 53 runs out of a total of 2700 runs, then it will be reinitialised in whatever manner allows it to continue from that point. I expect Analysis elements to have their states similarly stored and loaded from the database.

Campaign objects only have one element slot because only one element can be active at any one time.

raar1 commented 5 years ago

Based on discussions with @dww100 (and on the work of @orbitfold with the DB backend) I think the base element definition will now have three extra functions:

serialize(), deserialize() and is_restartable()

The first two are to be used to store the (current) state of the element in the database, and to restart from that stored state if needed. is_restartable() will return a bool to indicate whether it is possible to restart such an element.

It is my expectation that completely stochastic sampling elements should be able to save their state easily (for example, simply how many more draws from the distribution they have left). Similarly, Stochastic Collocation and PCE both seem to generate their nodes/weights in a pre-determined sequence, so such a sampling element should be able to store what iteration it reached, and restart from there.

Most analysis elements, by their very nature, will likely not be restartable. If the HPC job fails half way through an analysis element, the campaign will simply have to start over for the analysis step.

dww100 commented 5 years ago

Should is_restartable be a function or an attribute?

raar1 commented 5 years ago

I was hoping to put it in the element base.py as e.g.

def is_restartable(self):
    raise NotImplementedError

so anyone making a new element is forced to make it return something. I'm not sure how to enforce that if it's just a variable with (presumably) default value.

dww100 commented 5 years ago

My feeling is that by default it should just be is_restartable = False. The assumption being that if you are implementing a one short sampler that you would know what you are doing.

raar1 commented 5 years ago

OK, that's fine by me. But either way we should probably make it consistent with the way versions are handled for elements currently.

dww100 commented 5 years ago

@djgroen You should have a quick review of this to see that it makes sense to you.

dww100 commented 5 years ago

@raar1 I think we should write up something that looks like a design document based on this. The one issue I think I would like clarified is the execute step. My understanding is that this would not be necessary - i.e. I could:

  1. Sample and Encode a load of runs, save Campaign state, finish script.
  2. Run however I want (i.e. hand rolled batch script, FabSim, small Dask script, QCG, whatever).
  3. Recreate Campaign from saved state in new script, run analysis.
  4. Celebrate great victory.
raar1 commented 5 years ago

I agree with drafting the design document. Here's a first stab (very preliminary) at some pseudocode that kind of fits the general idea:

# User defined execution function. Has to accept run_dir (campaign gives it this)
# but doesn't have to use it.
def user_def_exec_fn(run_dir):
    os.system("cd " + run_dir + " && simulation_code\n")

# Set encoder, decoder, aggregator and user defined execution function for this campaign
my_campaign.set_encoder(EncoderGeneric(delimiter="#"))
my_campaign.set_decoder(DecoderCSV(output_filename=output_filename, output_columns=output_columns)
my_campaign.set_aggregator(Collate())
my_campaign.set_execution(ExecuteLocal(user_def_exec_fn))

# Set the campaign to use a sampling element
sampler = uq.elements.sampling.RandomSampler(num=number_of_samples)
my_campaign.set_element(sampler)

# Run batches of 100 jobs.
while my_campaign.has_runs_remaining() == True:
    my_campaign.run(100)
    my_campaign.aggregate()

# Analysis
...

Note that the execution function is completely user defined, and can contain anything. The campaign would merely pass relevant info to that function (that the user wouldn't know), such as where that specific run directory has been placed, but the user doesn't have to use any of it. Then if you ask the campaign to run 100 jobs, it will automatically sandwich that call in between the encoding and decoding steps, and clean up files afterwards.

If you prefer, we could do this in a more fine-grained way, in which the user has to explicitly call the encoding and decoding steps, but note that if they call the decoder themselves then we can't have it (optionally) automatically remove files upon completion because we don't know at what point they choose to call the decoder.

I suppose I really want to avoid the "encode everything -> execute everything -> decode everything -> aggregate everything" approach we currently have, since it just isn't scalable. I was hoping to enforce (somehow) that a "run" would always encode-execute-decode in one inseparable triplet. Running the decoder as part of the run means we immediately get a distillation of what we want as soon as the execution step terminates, so we can optionally delete the rest of the output files right away (assuming no errors occurred) in cases where space is tight.

We also need to think about how/when the aggregator is called, as it will need to work in a gradual fashion.

Any thoughts? Can we stick to pseudocode for now, just because I find it easier to evaluate what design choices will mean when I see what the final script might look like.

dww100 commented 5 years ago

So the encode and decode happen as part of campaign.run()?

Either way I'd rather not have a user create an essentially dummy function - maybe have one as the default execution?

raar1 commented 5 years ago

I was hoping to have encode and decode essentially happen as part of run(), yes. But I'm not sure how best to do this, especially in the PJM case where we want to make the encoding/decoding happen "in parallel". This would require encode, for example, to be some sort of standalone script? How do we pass run info to it, without effectively doing a ton of file coupling?

I'm not too sure what you mean by a dummy function - the user needs to specify what should happen for execution somehow, no? So I would always expect that function to contain some kind of code? Or have I misunderstood your point?

raar1 commented 5 years ago

Perhaps for the standalone encode problem, we could go back to having an if __name__ == "__main__": block that runs using commandline arguments when it detects the encoder is being called as a separate script. This would work with the PJM case. But we'd need to think about how we're telling it what it should "encode" - do we pass a file name to a dict/JSON? Or do we pass the name of the campaign, and it has to look in the database? None of these seem elegant...

djgroen commented 5 years ago

Hi guys,

Just my two cents, but I think "run" can be a bit ambiguous. It could refer to just executing a job, or to running the whole application. Perhaps it's better to have a separate "execute" function which only does the execution step, and a "do_campaign" function that does encode, execute, decode and aggregation?

Feel free to bully or ignore me if this is off-target, but I thought it could be worthwhile to at least articulate this thought somewhere ;).

bartoszbosak commented 5 years ago

@raar1 I think that if we look at EasyVVUQ from a bit different perspective and if we assume that the campaign can be easily recovered from DB and used to execute/run each of the steps (e.g. encoding, execution, decoding,..) by "worker" processes independently, the actual processing may be moved to Pilot Job or FabSim3. It would just require definition of common interface for standalone processes, but it seems feasible.

I also vote for @djgroen proposition to have "do_campaign" rather than "run".

raar1 commented 5 years ago

@bartoszbosak OK I think we're all essentially arguing for the same thing, but what form the interface should take is still ambiguous. Let's say we have N worker processes, what should happen? I can see at least two different classes of approach:

  1. Each (individual) process loads the campaign DB and then starts running jobs (encoding, execution, decoding) but how does it know which runs are assigned to it? This could be predetermined in the single-threaded region of the workflow, or work dynamically. In the latter case the database would have to be capable of parallel writes to keep the database up-to-date with what jobs have been "claimed".

  2. There is a script running in a single thread that is farming out one run at a time to whichever worker processes are free (obviously the PJM handles this). I believe this is essentially what the example pseudocode presented by PSNC suggested (right?). In such a case, we would have to specify a user-defined execute() function, that would simply submit a job to the PJM to run a small script that itself does encoding, simulation execution and decoding within it. This script could be made very concise with a do_campaign() function as @djgroen suggests, although I find that name misleading too and would prefer a different one (I think aggregation will need to be done in the single threaded region).

I suppose we will, at least at this stage, opt for the type of approach detailed in 2? That's certainly fine with me, but I don't necessarily want to rule out other execution patterns, seeing as this is an HPC project. Also, I certainly do want to push this processing onto the middleware (QCG, FabSim, whatever) and that has always been the goal, but it needs to be done in a controlled and formalized way. I find the approach in 2 to be still quite hacky, and would prefer there to be some kind of formal execution/middleware plugin class that can have different implementations depending on the middleware being used.

bartoszbosak commented 5 years ago

Hi @raar1. I think that our idea is a mix of your two ideas;-) Let's give our idea the number 3:

  1. As in your proposition 2. there is one master script written with the help of preferred tool, e.g. with QCG-PJ or FabSim, that manages the whole experiment. In a case of QCG-PJ it is started as the first Pilot Job task. It initiates the campaign object and then submits a number of subsequent tasks (processes) that are responsible for encoding, execution, encoding etc. I can imagine that depending on particular needs the processes can have different granularity (e.g. the encoding, execution & encoding can be joined or processed independently, the execution task can take many runs or just a single run and so on). Master script can also do some other processing after the parallel execution step. I think that it is doable to keep information about the way how the elementary tasks are performed on a level of execution middleware, but having the DB, it seems to be more optimal to left this task to EasyVVUQ. In this case each of the tasks should be allowed to read the campaign object from the DB and to write this object (or at least information about particular run) to the DB. This seems to be the thing that you've proposed in the version 1. There is a question to @orbitfold if it is possible, but I hope it can be.

I think that this a bit relaxed interface of EasyVVUQ wouldn't complicate the usage of the tool by its users (the high-level methods can be still available), but will allow to optimize the execution of complex and resource-demanding workflows on a level of particular execution middleware, which seems to be a good place for this job,

raar1 commented 5 years ago

The PJM integration discussion is now moved to issue #73