Closed ahurli closed 3 years ago
As an example of the output of this function, passing in variable_to_parse
as {'DispatchGen': '1'} (i.e. only parse out the DispatchGen data), I get the following:
{'DispatchGen': g10390i g10390 g10391i g10391 g10392i g10392 g10393i g10393 g10394i g10394 g10395i g10395 g10396i g10396 ... g14013i g14013 g14014i g14014 g14015i g14015 g14016i g14016 g14017i g14017 g14018i g14018 g14019i g14019
1/1/2016 0:00 0.0 78.59986 0.0 78.598211 0.0 78.598211 0.0 78.598211 0.0 78.598211 0.0 78.598211 0.0 6.366159 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1/1/2016 1:00 0.0 78.59986 0.0 78.598211 0.0 78.598211 0.0 78.598211 0.0 78.598211 0.0 78.598211 0.0 6.366159 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1/1/2016 2:00 0.0 78.59986 0.0 78.598211 0.0 78.598211 0.0 78.598211 0.0 78.598211 0.0 78.598211 0.0 6.366159 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1/1/2016 3:00 0.0 88.401611 0.0 88.399765 0.0 88.399765 0.0 88.399765 0.0 88.399765 0.0 88.399765 0.0 7.160047 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1/1/2016 4:00 0.0 84.958371 0.0 84.956598 0.0 84.956598 0.0 84.956598 0.0 84.956598 0.0 84.956598 0.0 6.881164 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
12/31/2016 19:00 0.0 79.269303 0.0 79.267641 0.0 79.267641 0.0 79.267641 0.0 79.267641 0.0 79.267641 0.0 6.42038 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
12/31/2016 20:00 0.0 79.269303 0.0 79.267641 0.0 79.267641 0.0 79.267641 0.0 79.267641 0.0 79.267641 0.0 6.42038 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
12/31/2016 21:00 0.0 79.269303 0.0 79.267641 0.0 79.267641 0.0 79.267641 0.0 79.267641 0.0 79.267641 0.0 6.42038 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
12/31/2016 22:00 0.0 78.59986 0.0 78.598211 0.0 78.598211 0.0 78.598211 0.0 78.598211 0.0 78.598211 0.0 6.366159 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
12/31/2016 23:00 0.0 78.59986 0.0 78.598211 0.0 78.598211 0.0 78.598211 0.0 78.598211 0.0 78.598211 0.0 6.366159 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
[8784 rows x 5232 columns]}
Where is the bottleneck for the current code? 20 minute is awfully long, if we can't improve this much then maybe we want to be reading from the tab-separated-value outputs.
Where is the bottleneck for the current code? 20 minute is awfully long, if we can't improve this much then maybe we want to be reading from the tab-separated-value outputs.
My guess is the way I'm building the pandas dataframe is not the most efficient. Right now, I'm initializing a dataframe with the timeseries index, but no columns. There's a loop parsing through the variables dictionary that adds new columns as new column values are found and fills in the gaps of timepoints that haven't been seen yet. I think looping through the dictionary twice, the first time to just gather column names so that we can initialize the full dataframe and the second to fill in values, would be faster as I suspect adding columns to an existing dataframe is expensive. I can test this out.
Where is the bottleneck for the current code? 20 minute is awfully long, if we can't improve this much then maybe we want to be reading from the tab-separated-value outputs.
My guess is the way I'm building the pandas dataframe is not the most efficient. Right now, I'm initializing a dataframe with the timeseries index, but no columns. There's a loop parsing through the variables dictionary that adds new columns as new column values are found and fills in the gaps of timepoints that haven't been seen yet. I think looping through the dictionary twice, the first time to just gather column names so that we can initialize the full dataframe and the second to fill in values, would be faster as I suspect adding columns to an existing dataframe is expensive. I can test this out.
It may also be faster to identify which output variables to expect using the input data, that should allow you to start with a data frame of the proper size before filling values.
You may also be able to get a speed-up of the variable name processing by using regular expressions to parse the variable names, if that part of the process is an appreciable part of the total time.
I found something that works! Now it takes <15 seconds to process.
Unfortunately, I was unable to get Bainan's suggestion of preallocating the dataframe to work as allocating the empty dataframe still took several minutes. Thanks to @kasparm 's help, I was able to use the explode
functionality in pandas and some preprocessing to get the time down significantly. As there's still a preprocessing step to reformat the dictionary before transforming it into a pandas dataframe, there might still be some time issues when we scale to a larger grid and more timepoints, but hopefully this is a better start.
I'll be cleaning up the new code/addressing the docstring issues this afternoon before converting this to an actual PR.
How would instantiating an empty dataframe take several minutes? The only thing I can think of is if the indices are not set right and you're trying to use more memory than your computer has available.
Since this produces an intermediate data structure, what will the inputs/outputs be for the function(s) that eventually return the PF/PG/LOAD_SHED?
Since this produces an intermediate data structure, what will the inputs/outputs be for the function(s) that eventually return the PF/PG/LOAD_SHED?
Yeah, describing the inputs/outputs of this function are a little tricky. Because of the nested nature of the original data structure, I felt like I was reusing a lot of the same names for things at different levels of abstraction.
The original pickle file data is stored in a flat dictionary where the keys are a combination of
variable names
andvariable parameters
and the values are a dictionary whereValue
is the datapoint for that combination of variable name and parameters.The output dictionary for this function is a dataframe for each variable passed in. The index of these dataframes is the timestamps contained in the
mapping
file passed in which maps timepoints to timestamps. The columns of these dataframes are a comma-separated string of the parameters embedded in the key of the original pickle file dictionary with the timepoint removed and preserved order otherwise.
Do you think these two explanations make sense? I was thinking, then, that the additional functions for PF/PG/LOAD_SHED, etc would be mostly remapping the column names from the pickle file output parameter tuple to what the Scenario
actually expects (e.g. g10774
--> 10774
). Besides perhaps some unit conversions/normalization, I was assuming the data content within the dataframe would be mostly the same.
The original pickle file data is stored in a flat dictionary where the keys are a combination of
variable names
andvariable parameters
and the values are a dictionary whereValue
is the datapoint for that combination of variable name and parameters.The output dictionary for this function is a dataframe for each variable passed in. The index of these dataframes is the timestamps contained in the
mapping
file passed in which maps timepoints to timestamps. The columns of these dataframes are a comma-separated string of the parameters embedded in the key of the original pickle file dictionary with the timepoint removed and preserved order otherwise.Do you think these two explanations make sense? I was thinking, then, that the additional functions for PF/PG/LOAD_SHED, etc would be mostly remapping the column names from the pickle file output parameter tuple to what the
Scenario
actually expects (e.g.g10774
-->10774
). Besides perhaps some unit conversions/normalization, I was assuming the data content within the dataframe would be mostly the same.
I think it would be clearer if we:
:param
linesstr
.Since this produces an intermediate data structure, what will the inputs/outputs be for the function(s) that eventually return the PF/PG/LOAD_SHED?
Yeah, describing the inputs/outputs of this function are a little tricky. Because of the nested nature of the original data structure, I felt like I was reusing a lot of the same names for things at different levels of abstraction.
This function seems to return a dictionary of data frames (keys are strings), where the indices are timestamps and the columns are switch indices for branches/plants/buses. The function that would use the outputs from this function to produce the PF/PG/LOAD_SHED would presumably also take a Grid object, and then return pandas data frames (in either a tuple or a dictionary) that have the same index, but have columns that match what's in the Grid object, right? I'm just trying to get an idea of the big picture that this PR serves.
Makes sense for your two points up above.
This function seems to return a dictionary of data frames (keys are strings), where the indices are timestamps and the columns are switch indices for branches/plants/buses. The function that would use the outputs from this function to produce the PF/PG/LOAD_SHED would presumably also take a Grid object, and then return pandas data frames (in either a tuple or a dictionary) that have the same index, but have columns that match what's in the Grid object, right? I'm just trying to get an idea of the big picture that this PR serves.
I think whether or not the following function takes a Grid
object or just some sort of mapping function/other mapping object that carries the same information is a good design question outside of the scope of this PR (does this mean #47 does depends on #49?) . But as far as the output is concerned, yes, that was what I was imagining too. The point of this function was to make it generic enough that it could potentially be reused when, as an example, we add storage and need to do the same timepoint<>timestamp conversion for the timeseries data that produces assuming those results follow the same output data structure.
Pull Request doc
Purpose
Create a function to convert the data contained in the output pickle file produced by
switch
into a pandas dataframe with the correct timestamps to index each data point.Note that creating the actual
PF
,PG
, etc. files are not included as part of this functionality. This is intended to be a general-purpose function as an intermediary to reformat the data for further manipulation into the format expected byPostREISE
.What the code is doing
There are 2 functions in the
helpers.py
file: one to read in the mapping file that will eventually by generated by the input and the other to create the transformed data frames.There is also a new dictionary mapping in
const.py
to record which parameter is the timepoint for each timeseries variable.Where to look
helpers.py
Usage Example/Visuals
Time estimate
15 min