output Ensemble Tree details to dedicated output files #1132

Closed SGeeversAtVortech closed 2 weeks ago

In GitLab by @MaartenS on May 16, 2022, 17:36

Project nr: 11208050-089 Client: Rijkswaterstaat

Test system: see files in N:\Deltabox\Postbox\Smoorenburg, Maarten\CIP\RTC-Dev_TreeOutputs\example model\ folder. The 'from obs' is the first run in a closed-loop experiment. You can take any other .zip from the 'from_sim' folder if a more extreme forcing is needed (take the .png as guidance for selecting the T0 of interest.

For better interpretation of the TB-MPC controls and allow specific visualization in Delft-FEWS, some timeseries (see recent development in FEWS-26172) with details of the clustering and the advised control are needed. We here follow the output XML setup that was made for the third party Inzet-Tool of FEWS-Vecht already (not RTC-Tools 1), and expand where necessary.

For the numbering, in our discussion of 20-5-2022 we can up with the follow fixed branch numbering approach, example made with 2 control points (CP) where groups are split into 3 subgroups: 0 - CP - 1, 2, 3 - CP - 4,5,6,7,8,9,10,11,12. The final group branchnr (so 4-12 in the example) can also be taken as the 'groupnr' where the respective ensemble members end up in.

The following new outputs are needed in the ./output folder:

timeseries_export-_details.xml (ensembleId equal to input, ensembleMembers indicated same way as in input) containing:

the ensemble variable 'groupnr' with for each member the nr of the group in the period after the last control point as equal nr for all timesteps; so each ensemble member has 1 single unique nr. e.g. 11,11,11,...11,11,11.
the ensemble variable 'branchnr' with for each member the nr of the group it belongs to for each time step; e.g. 0,0,0,0,3,3,33,11,11,11,11.
a timeseries of parameterId=ensemble_fraction and locationId=TB-MPC (see discussion on defaults below), with the 'probability' (better known as weight) for the branch nr it belongs to (see details in the specs for the second file)
(optional; if we have a good idea and budget allows): a timeseries each ensemble member that indicates how much it drives the solution.

N.B.1: above takes the name of the ensembleId used in the input XML since it is possible to combine different ensembles (e.g. meteo and price, or meteo and initial condition, whereby clustering is done per ensembleId or over all ensembleId's together; correct me if I'm wrong ;)).

timeseries_export-control_details.xml (ensembleId=TB-MPC, ensembleMemberIndex are the integer nr's of the control branches==nr of the branch after the last control point; starting from 0.) containing:

a timeseries of each control variable for each branch nr. Total of all combinations of locationId, parameterId and branchnr.
a timeseries of parameterId=ensemble_fraction_branch and locationId=TB-MPC (see discussion on defaults below), with the 'probability' (better known as relative weight) per branch number (values 0-12 in example above). Until first control point, all timeseries will here equal 1.0, and after the last control point all values sum up to 1.0 with the finite precision that is used; in the period in between, only the unique values together sum up to 1.0)
a timeseries of parameterId=ensemble_fraction_group and locationId=TB-MPC (see discussion on defaults below), with the 'probability' (better known as relative weight (sum of weights of each ensemble member in the group, divided by the total sum of weights of all member, of course per timestep) of the period after the last control point indicated per groupnr (4-12 in example): i.e. single valued for all time steps.
(optional; if we have a good idea and budget allows): a timeseries that for each branch nr indicates how much it drives the solution: timeseries of parameterId=dominance_factor and locationId=TB-MPC.

These outputs are turned on by setting some switch of the PIMixin (and/or CSV if that is general enough). If turned on: the RTC variables default to some names (see suggestion in the text here, but should match with RTC conventions). The rtcDataConfig.xml should then map these variables to FEWS locationId and parameterId to define the output series.

N.B.2. Alongside, Tjerk proposed that more than 9 groups can be formed after each control point, a limitation of the 0-0-0, 0-0-1...0-0-9 approach for numbering the groups separated by control points (the '-', so 2 points in this case). These nr's should be integers, and the adminstration should be done in tuples.

N.B.3: (optional) Make a file called tree_details.??? to store effectively a dictionary to look up how each groupnr matches one tree (that can be a tuple of integers), e.g. {1: (0,1,2,13),...}. If budget allows, we can make this in a way that aligns well with graph technology standards.

In GitLab by @TPiovesan on May 19, 2022, 15:30

I feel like we should be careful about the nr of the group convention. Is it a counter of all (possible) branches? I.e., in total we have sum_j=0^{n} k^n groups where n is the number of branching times of a k-tree. (To be discussed in person)

In GitLab by @MaartenS on May 20, 2022, 15:04

@vreeken : please consider if N.B.1 above is worth considering for now ;)

In GitLab by @TPiovesan on May 20, 2022, 15:42

fyi: the ensemble is uniquely determined by its ensembleMemberIndex. The ensembleId is disregarded.

In GitLab by @vreeken on May 23, 2022, 17:37

a timeseries of parameterId=ensemble_fraction and locationId=TB-MPC (see discussion on defaults below), with the 'probability' (better known as weight) for the branch nr it belongs to (see details in the specs for the second file)

group nr? or branch nr (i.e. timeseries with changing values)

a timeseries of each control variable for each branch nr. Total of all combinations of locationId, parameterId and branchnr.

Not sure that makes sense for branch nrs. Group nr?

In GitLab by @TPiovesan on May 23, 2022, 18:39

Confusing terminology indeed. I'm assuming that _groupnr is the subset of the _branchnr referring to the leaves of the tree. If that is the case, it should be:

The ensemble_fraction was for the _branchnr.
The control for the _groupnr.

In GitLab by @vreeken on May 23, 2022, 18:44

First version

```python def output_control_tree_details(self): import bisect import itertools branches = self.control_tree_branches k = self.control_tree_options()["k"] if len(branches) > 1: assert max(x[0] for x in branches.keys() if len(x) > 1) == k - 1 # Exhaustive number of branches, including empty leaves of empty leaves n_splits = max(len(x) for x in branches.keys()) branches_keys_exhaustive = list(itertools.chain.from_iterable([ itertools.product(range(k), repeat=n) for n in range(n_splits + 1) ])) branches_sorted = {k: branches.get(k, list()) for k in branches_keys_exhaustive} branches_seq = {i: v for i, (k, v) in enumerate(branches_sorted.items())} # Invert, i.e. ensemble member to (sequential) branch number ensemble_member_to_branch = {} for branch, members in branches_seq.items(): for m in members: ensemble_member_to_branch.setdefault(m, list()).append(branch) times = self.times() branching_times = self.control_tree_options()["branching_times"] section_indices = [bisect.bisect_right(branching_times, x) for x in self.times()] timeseries = [{} for _ in range(self.ensemble_size)] # Dict[Tuple[location, parameter, ensemble_member], np.ndarray] for ensemble_member, branches in ensemble_member_to_branch.items(): branchnr_ts = Timeseries(times, [branches[x] for x in section_indices]) timeseries[ensemble_member]["branch_nr"] = branchnr_ts groupnr_ts = Timeseries(times, branches[-1]) timeseries[ensemble_member]["group_nr"] = groupnr_ts # Ensemble weights per branch weights = [self.ensemble_member_probability(e) for e in range(self.ensemble_size)] total_weight = sum(weights) weights_per_branch = [0.0] * len(branches_seq) for b, members in branches_seq.items(): for m in members: weights_per_branch[b] += weights[m] weights_per_branch[b] /= total_weight for ensemble_member, branches in ensemble_member_to_branch.items(): branch_nr = [branches[x] for x in section_indices] branch_nr_weights_ts = Timeseries(times, [weights_per_branch[x] for x in branch_nr]) timeseries[ensemble_member]["branch_nr_ensemble_fraction"] = branch_nr_weights_ts group_nr_weights_ts = Timeseries(times, weights_per_branch[branches[-1]]) timeseries[ensemble_member]["group_nr_ensemble_fraction"] = group_nr_weights_ts self._write_to_xml("timeseries_export_details", timeseries) # And now the other way around. Ensemble member is not "self.ensemble_size", but the number of branches (i.e. len(branches_seq)). timeseries = [{} for _ in range(len(branches_seq))] first_group_nr = next(i for i, v in enumerate(branches_keys_exhaustive) if len(v) == max(len(x) for x in branches_keys_exhaustive)) for b in range(first_group_nr, len(branches_seq)): if not branches_seq[b]: continue ensemble_member = branches_seq[b][0] results = self.extract_results(ensemble_member) for v in self.dae_variables["control_inputs"]: timeseries[b][v.name()] = Timeseries(times, results[v.name()]) # group nr to branch nrs via ensemble member. And then a timeseries of weights. branch_nr = [ensemble_member_to_branch[ensemble_member][x] for x in section_indices] timeseries[b]["branch_nr"] = Timeseries(times, branch_nr) branch_nr_weights_ts = Timeseries(times, [weights_per_branch[x] for x in branch_nr]) timeseries[b]["branch_nr_ensemble_fraction"] = branch_nr_weights_ts group_nr_weights_ts = Timeseries(times, weights_per_branch[b]) timeseries[b]["group_nr_ensemble_fraction"] = group_nr_weights_ts self._write_to_xml("timeseries_export_control_details", timeseries) from typing import List, Dict def _write_to_xml(self, output_basename: str, timeseries: List[Dict[str, np.ndarray]]): from datetime import timedelta import rtctools.data.pi as pi import rtctools.data.rtc as rtc assert isinstance(self, PIMixin) data_config = rtc.DataConfig(self._input_folder) times = self.times() # Get time stamps if len(set(times[1:] - times[:-1])) == 1: dt = timedelta(seconds=times[1] - times[0]) else: dt = None timeseries_export = pi.Timeseries( data_config, self._output_folder, output_basename, binary=False, pi_validate_times=False, make_new_file=True) # Write the time range for the export file. timeseries_export.times = [self.io.reference_datetime + timedelta(seconds=s) for s in times] # Write other time settings timeseries_export.forecast_datetime = self.io.reference_datetime timeseries_export.dt = dt timeseries_export.timezone = self.timeseries_import.timezone # Write the ensemble properties for the export file. timeseries_export.contains_ensemble = True timeseries_export.ensemble_size = len(timeseries) timeseries_export.contains_ensemble = True # Start looping over the ensembles for extraction of the output values. for ensemble_member in range(len(timeseries)): for variable, ts in timeseries[ensemble_member].items(): values = ts.values # Check if ID mapping is present try: data_config.pi_variable_ids(variable) except KeyError: logger.debug( 'PIMixin: variable {} has no mapping defined in rtcDataConfig ' 'so cannot be added to the output file.'.format(variable)) continue # Add series to output file. # NOTE: We use the unit of the zeroth ensemble member, as # we might be outputting more ensembles than we read in. timeseries_export.set( variable, values, unit=self.timeseries_import.get_unit(variable, ensemble_member=0), ensemble_member=ensemble_member) timeseries_export.write() ```

Not sure where to put this just yet. It doesn't belong in PIMixin, as it relies on ControlTreeMixin. It doesn't belong in the latter either, because it relies on PIMixin.

I also do not use set_timeseries, as that would output everything in the main timeseries_export.xml file (which is not what we want afaict).

In GitLab by @MaartenS on May 23, 2022, 21:49

in summary, this means that if one wants to combine different ensemble products, some preprocessing tooling is needed on FEWS or RTC side. That's fine, but good to know!

In GitLab by @MaartenS on May 23, 2022, 22:01

For the control, the ensembleMemberIndex equals the groupnr's (==leafnr's==number of the final branch it belongs to). In the example these indexes are thus 4...12, whereby the indexes that have no members have only missing data in the output.

For convenience, I think we should in both files put the ensemble_fraction_branch as well as the ensemble_fraction_final variables, respectively the probability at each time step and the timeseries filled with the probabilities of the last time step (so belonging to the groupnr=leaf)

In GitLab by @MaartenS on May 24, 2022, 14:47

Thanks Tjerk, As just discussed, it is fine to keep it as user code now because we are still prototyping and maybe our design is not general enough to land in RTC-Tools main code. Also, we are only considering XML here, not CSV and NetCDF (but similar series could be put there too). What's next, preferably in the following order: 1) code clean-up and documentation 2) testing if results are as desired

In GitLab by @MaartenS on Nov 27, 2022, 17:13

Works like a charm! Only want to add some line to cleverly computed branch based postprocessing and then also output that. That's not enough to keep this issue open, so I'll close.

Deltares / rtc-tools

output Ensemble Tree details to dedicated output files #1132