Feat: Expand Timepoint Data into Timeseries Data

Breakthrough-Energy / SwitchWrapper

Wrapper for Switch

MIT License

1 stars 2 forks source link

Feat: Expand Timepoint Data into Timeseries Data #68

Closed ahurli closed 3 years ago

ahurli commented 3 years ago

Pull Request doc

Purpose

Create a function to convert the data contained in the output pickle file produced by switch into a pandas dataframe with the correct timestamps to index each data point.

Note that creating the actual PF, PG, etc. files are not included as part of this functionality. This is intended to be a general-purpose function as an intermediary to reformat the data for further manipulation into the format expected by PostREISE.

What the code is doing

There are 2 functions in the helpers.py file: one to read in the mapping file that will eventually by generated by the input and the other to create the transformed data frames.

There is also a new dictionary mapping in const.py to record which parameter is the timepoint for each timeseries variable.

Where to look

helpers.py

Usage Example/Visuals


from switchwrapper.helpers import parse_timepoints, load_mapping

pickle_file = 'results.pickle'
mapping_file = 'mapping.csv'

with open(pickle_file, 'rb') as f:
    results = pickle.load(f)

data = results.solution._list[0].Variable

mapping = load_mapping(mapping_file)

variables_to_parse = ["DispatchGen", "DispatchTx", "UnservedLoad"]

parsed_data = parse_timepoints(data, variables_to_parse, mapping)

Time estimate

15 min

ahurli commented 3 years ago

As an example of the output of this function, passing in variable_to_parse as {'DispatchGen': '1'} (i.e. only parse out the DispatchGen data), I get the following:

{'DispatchGen':                  g10390i     g10390 g10391i     g10391 g10392i     g10392 g10393i     g10393 g10394i     g10394 g10395i     g10395 g10396i    g10396  ... g14013i g14013 g14014i g14014 g14015i g14015 g14016i g14016 g14017i g14017 g14018i g14018 g14019i g14019
1/1/2016 0:00        0.0   78.59986     0.0  78.598211     0.0  78.598211     0.0  78.598211     0.0  78.598211     0.0  78.598211     0.0  6.366159  ...     0.0    0.0     0.0    0.0     0.0    0.0     0.0    0.0     0.0    0.0     0.0    0.0     0.0    0.0
1/1/2016 1:00        0.0   78.59986     0.0  78.598211     0.0  78.598211     0.0  78.598211     0.0  78.598211     0.0  78.598211     0.0  6.366159  ...     0.0    0.0     0.0    0.0     0.0    0.0     0.0    0.0     0.0    0.0     0.0    0.0     0.0    0.0
1/1/2016 2:00        0.0   78.59986     0.0  78.598211     0.0  78.598211     0.0  78.598211     0.0  78.598211     0.0  78.598211     0.0  6.366159  ...     0.0    0.0     0.0    0.0     0.0    0.0     0.0    0.0     0.0    0.0     0.0    0.0     0.0    0.0
1/1/2016 3:00        0.0  88.401611     0.0  88.399765     0.0  88.399765     0.0  88.399765     0.0  88.399765     0.0  88.399765     0.0  7.160047  ...     0.0    0.0     0.0    0.0     0.0    0.0     0.0    0.0     0.0    0.0     0.0    0.0     0.0    0.0
1/1/2016 4:00        0.0  84.958371     0.0  84.956598     0.0  84.956598     0.0  84.956598     0.0  84.956598     0.0  84.956598     0.0  6.881164  ...     0.0    0.0     0.0    0.0     0.0    0.0     0.0    0.0     0.0    0.0     0.0    0.0     0.0    0.0
...                  ...        ...     ...        ...     ...        ...     ...        ...     ...        ...     ...        ...     ...       ...  ...     ...    ...     ...    ...     ...    ...     ...    ...     ...    ...     ...    ...     ...    ...
12/31/2016 19:00     0.0  79.269303     0.0  79.267641     0.0  79.267641     0.0  79.267641     0.0  79.267641     0.0  79.267641     0.0   6.42038  ...     0.0    0.0     0.0    0.0     0.0    0.0     0.0    0.0     0.0    0.0     0.0    0.0     0.0    0.0
12/31/2016 20:00     0.0  79.269303     0.0  79.267641     0.0  79.267641     0.0  79.267641     0.0  79.267641     0.0  79.267641     0.0   6.42038  ...     0.0    0.0     0.0    0.0     0.0    0.0     0.0    0.0     0.0    0.0     0.0    0.0     0.0    0.0
12/31/2016 21:00     0.0  79.269303     0.0  79.267641     0.0  79.267641     0.0  79.267641     0.0  79.267641     0.0  79.267641     0.0   6.42038  ...     0.0    0.0     0.0    0.0     0.0    0.0     0.0    0.0     0.0    0.0     0.0    0.0     0.0    0.0
12/31/2016 22:00     0.0   78.59986     0.0  78.598211     0.0  78.598211     0.0  78.598211     0.0  78.598211     0.0  78.598211     0.0  6.366159  ...     0.0    0.0     0.0    0.0     0.0    0.0     0.0    0.0     0.0    0.0     0.0    0.0     0.0    0.0
12/31/2016 23:00     0.0   78.59986     0.0  78.598211     0.0  78.598211     0.0  78.598211     0.0  78.598211     0.0  78.598211     0.0  6.366159  ...     0.0    0.0     0.0    0.0     0.0    0.0     0.0    0.0     0.0    0.0     0.0    0.0     0.0    0.0

[8784 rows x 5232 columns]}

danielolsen commented 3 years ago

Where is the bottleneck for the current code? 20 minute is awfully long, if we can't improve this much then maybe we want to be reading from the tab-separated-value outputs.

ahurli commented 3 years ago

Where is the bottleneck for the current code? 20 minute is awfully long, if we can't improve this much then maybe we want to be reading from the tab-separated-value outputs.

My guess is the way I'm building the pandas dataframe is not the most efficient. Right now, I'm initializing a dataframe with the timeseries index, but no columns. There's a loop parsing through the variables dictionary that adds new columns as new column values are found and fills in the gaps of timepoints that haven't been seen yet. I think looping through the dictionary twice, the first time to just gather column names so that we can initialize the full dataframe and the second to fill in values, would be faster as I suspect adding columns to an existing dataframe is expensive. I can test this out.

danielolsen commented 3 years ago

Where is the bottleneck for the current code? 20 minute is awfully long, if we can't improve this much then maybe we want to be reading from the tab-separated-value outputs.

My guess is the way I'm building the pandas dataframe is not the most efficient. Right now, I'm initializing a dataframe with the timeseries index, but no columns. There's a loop parsing through the variables dictionary that adds new columns as new column values are found and fills in the gaps of timepoints that haven't been seen yet. I think looping through the dictionary twice, the first time to just gather column names so that we can initialize the full dataframe and the second to fill in values, would be faster as I suspect adding columns to an existing dataframe is expensive. I can test this out.

It may also be faster to identify which output variables to expect using the input data, that should allow you to start with a data frame of the proper size before filling values.

You may also be able to get a speed-up of the variable name processing by using regular expressions to parse the variable names, if that part of the process is an appreciable part of the total time.

ahurli commented 3 years ago

I found something that works! Now it takes <15 seconds to process.

Unfortunately, I was unable to get Bainan's suggestion of preallocating the dataframe to work as allocating the empty dataframe still took several minutes. Thanks to @kasparm 's help, I was able to use the explode functionality in pandas and some preprocessing to get the time down significantly. As there's still a preprocessing step to reformat the dictionary before transforming it into a pandas dataframe, there might still be some time issues when we scale to a larger grid and more timepoints, but hopefully this is a better start.

I'll be cleaning up the new code/addressing the docstring issues this afternoon before converting this to an actual PR.

danielolsen commented 3 years ago

How would instantiating an empty dataframe take several minutes? The only thing I can think of is if the indices are not set right and you're trying to use more memory than your computer has available.

danielolsen commented 3 years ago

Since this produces an intermediate data structure, what will the inputs/outputs be for the function(s) that eventually return the PF/PG/LOAD_SHED?

ahurli commented 3 years ago

Since this produces an intermediate data structure, what will the inputs/outputs be for the function(s) that eventually return the PF/PG/LOAD_SHED?

Yeah, describing the inputs/outputs of this function are a little tricky. Because of the nested nature of the original data structure, I felt like I was reusing a lot of the same names for things at different levels of abstraction.

The original pickle file data is stored in a flat dictionary where the keys are a combination of variable names and variable parameters and the values are a dictionary where Value is the datapoint for that combination of variable name and parameters.

The output dictionary for this function is a dataframe for each variable passed in. The index of these dataframes is the timestamps contained in the mapping file passed in which maps timepoints to timestamps. The columns of these dataframes are a comma-separated string of the parameters embedded in the key of the original pickle file dictionary with the timepoint removed and preserved order otherwise.

Do you think these two explanations make sense? I was thinking, then, that the additional functions for PF/PG/LOAD_SHED, etc would be mostly remapping the column names from the pickle file output parameter tuple to what the Scenario actually expects (e.g. g10774 --> 10774). Besides perhaps some unit conversions/normalization, I was assuming the data content within the dataframe would be mostly the same.

danielolsen commented 3 years ago

The original pickle file data is stored in a flat dictionary where the keys are a combination of variable names and variable parameters and the values are a dictionary where Value is the datapoint for that combination of variable name and parameters.

The output dictionary for this function is a dataframe for each variable passed in. The index of these dataframes is the timestamps contained in the mapping file passed in which maps timepoints to timestamps. The columns of these dataframes are a comma-separated string of the parameters embedded in the key of the original pickle file dictionary with the timepoint removed and preserved order otherwise.

Do you think these two explanations make sense? I was thinking, then, that the additional functions for PF/PG/LOAD_SHED, etc would be mostly remapping the column names from the pickle file output parameter tuple to what the Scenario actually expects (e.g. g10774 --> 10774). Besides perhaps some unit conversions/normalization, I was assuming the data content within the dataframe would be mostly the same.

I think it would be clearer if we:

removed references to pickle files or Switch in the :param lines
Specified that they keys at each level of the nested dict are str.

Since this produces an intermediate data structure, what will the inputs/outputs be for the function(s) that eventually return the PF/PG/LOAD_SHED?

Yeah, describing the inputs/outputs of this function are a little tricky. Because of the nested nature of the original data structure, I felt like I was reusing a lot of the same names for things at different levels of abstraction.

This function seems to return a dictionary of data frames (keys are strings), where the indices are timestamps and the columns are switch indices for branches/plants/buses. The function that would use the outputs from this function to produce the PF/PG/LOAD_SHED would presumably also take a Grid object, and then return pandas data frames (in either a tuple or a dictionary) that have the same index, but have columns that match what's in the Grid object, right? I'm just trying to get an idea of the big picture that this PR serves.

ahurli commented 3 years ago

Makes sense for your two points up above.

This function seems to return a dictionary of data frames (keys are strings), where the indices are timestamps and the columns are switch indices for branches/plants/buses. The function that would use the outputs from this function to produce the PF/PG/LOAD_SHED would presumably also take a Grid object, and then return pandas data frames (in either a tuple or a dictionary) that have the same index, but have columns that match what's in the Grid object, right? I'm just trying to get an idea of the big picture that this PR serves.

I think whether or not the following function takes a Grid object or just some sort of mapping function/other mapping object that carries the same information is a good design question outside of the scope of this PR (does this mean #47 does depends on #49?) . But as far as the output is concerned, yes, that was what I was imagining too. The point of this function was to make it generic enough that it could potentially be reused when, as an example, we add storage and need to do the same timepoint<>timestamp conversion for the timeseries data that produces assuming those results follow the same output data structure.