Closed misi9170 closed 2 years ago
Thanks for tackling this, Misha!
Could we rewrite:
# Rewrite to avoid fragmentation
df_out = pd.concat([df_out, pd.DataFrame(values_median, index=df_out.index, columns=["{:s}_median".format(c) for c in cols_angular])], axis=1)
df_out = pd.concat([df_out, pd.DataFrame(values_min, index=df_out.index, columns=["{:s}_min".format(c) for c in cols_angular])], axis=1)
df_out = pd.concat([df_out, pd.DataFrame(values_max, index=df_out.index, columns=["{:s}_max".format(c) for c in cols_angular])], axis=1)
df_out = pd.concat([df_out, pd.DataFrame(values_std, index=df_out.index, columns=["{:s}_std".format(c) for c in cols_angular])], axis=1)
to
# Rewrite to avoid fragmentation
df_out = pd.concat(
[
df_out,
pd.DataFrame(values_median, index=df_out.index, columns=["{:s}_median".format(c) for c in cols_angular]),
pd.DataFrame(values_min, index=df_out.index, columns=["{:s}_min".format(c) for c in cols_angular]),
pd.DataFrame(values_max, index=df_out.index, columns=["{:s}_max".format(c) for c in cols_angular]),
pd.DataFrame(values_std, index=df_out.index, columns=["{:s}_std".format(c) for c in cols_angular])
],
axis=1
)
This could speed things up a bit, especially if df_out
contains a large number of data. Similar things could be done in other parts of the code where the df_out
variable is concatenated multiple times in a row.
It would also be good to benchmark the performance of the fix in time_operations.py
for large datasets. My previous solution in finding the data indices corresponding to the time-averaging windows felt very inefficient, but it was much faster than for-loops for large datasets iirc. I found the time efficiency to be a breaking issue when working with high-resolution data (1Hz) and hence was sticking to the previous approach. I do feel like there must be a function inside Pandas to just provide us with that information, but I was not able to find it when I implemented this code...
hi @Bartdoekemeijer and @misi9170 I'd like to jump in here! I think you're right @Bartdoekemeijer there are some good vectorized pandas functions for a lot of things we do, I wanted to propose working our way through the files and redoing some things,
I just finished this book: https://store.metasnake.com/effective-pandas-book (there's a kind of short summary of key concepts here (https://www.aidancooper.co.uk/pandas-anti-patterns/ )) which I think presents a nicely comprehensive approach to pandas coding I was hoping to adopt in FLASC. In my own work I've been hitting out-of-memory errors. It seems like we could make progress switching to this sort of chained, not mutated, workflow, and focus on using vectorized calls. Maybe more broad than this pull request but I thought nice to touch base on what I was thinking
Maybe we need to start with setting up the time-benchmarking functions and then start working the changes through,
Thanks for the inputs @paulf81! I'm a big proponent of revisiting the existing code and changing it to vectorized calls at much as we can! Also, I had tried to code up everything to prevent memory issues. I think many of the FLASC scripts can be run sequentially, so you can run the same script for chunks of the data, e.g., 1 week or 1 month of data at a time. Though again, happy to have that code revisited and think together!
Thanks @Bartdoekemeijer ! Yeah, definitely didn't mean to say there was any bugs or anything, I was just thinking more of opportunities for reprogramming pandas away from some things I think can lead to subtle issues. An example is when you get that "setting value with a copy" warning, we typically add a .copy() but I believe there is another approach. I just uploaded a notebook where I propose a replacement for the moving average function, I think I've cut the time down some but also following these new guidelines I think I got some general best practices in, what I was trying to do:
1) No mutations, or code that changes existing dataframes,
df['new_col'] = x
I think is a mutation
2) Chained operations so that all temporary memory assignments get released when the operation ends with an assignment of a dataframe
3) Trying wherever possible to use whole dataframe operations with base functions, versus apply calls of specific functions
Anyway, I'm still groping my way through all this new stuff, but what I understand is that for example that "setting copy with ..." warning is indicative of a possible unrobustness, the behavior of the code can be unpredictable but this style of coding is should never have those issues.
If you don't mind to look over the notebook (@Bartdoekemeijer and @misi9170 ) would be curious what you think!!
Thanks for doing this work, @paulf81. These speed-ups and better usage of Pandas will surely help a lot!
Scrolling through the notebook, I see there are a couple differences in the solutions between your and my old method.
_min
and _max
values for the angular columns, I first shift all the data to be zero around the mean. For example, if you have an array of [358.0, 1.0, 4.0]
, then the mean value is 1.0
, the min
is 358.0
and the max
is 4.0
in my approach. I think this makes sense in the angular sense, essentially saying min
and max
are the values to the left-most and right-most of the mean on the unit circle. If you take the simple min
and max
of the array, I can foresee issues with for example the following array: [350.0, 359.0, 1.0, 10.0]
. I think this explains the difference in those values between your and my old approach. I'd love to hear your thoughts on this.std
values differ, but I think it would be worth investigating. Hi @paulf81, thanks for the pandas updates---they're a great example for me to work from for chained operations! I've added a couple of new versions of df_movingaverage_new
to your notebook. These handle the standard deviation of a circular quantity, and also match rather closely with @Bartdoekemeijer's previous implementation (see plots at end of notebook). ..._new_2
uses scipy's circstd()
directly with a apply
, but this is slow, so I broke it down into the basic operations in ...new_3
which is considerably faster (I still have to apply
a numpy.log
operation, but this does not seem to be too detrimental).
I have not made any changes to the _max
, _min
, or _median
operations. What would be the use case for a rolling minimum, maximum, or median?
I have not made any changes to the
_max
,_min
, or_median
operations. What would be the use case for a rolling minimum, maximum, or median?
That's a valid question! I added these since I found similar statistics in existing downsampled SCADA data. They could be helpful in figuring out whether the high-resolution data contained outliers making the time-averaged mean unreliable, but then again you could also just use the standard deviation for that. Perhaps we can just leave the min
, max
and median
out for the time being, and we could always consider reintroducing them if there's a real need for it?
Great job @misi9170! I just went through the notebook and really liked what you did, and appreciated the new plots at the end, very convincing!!
Then to @Bartdoekemeijer 's points, I would vote in favor of leaving them out (or I guess more specifically leaving them "uncircular") and seeing if we hit a need down the road,
@misi9170 Would you mind to propose changes now to the code following your df_movingaverage_new_3?
Thanks Misha, Paul!! Could you quickly see how these changes hold up in terms of speed for large datasets before merging it in? If that's consistent, please feel free to merge in!
The timing comparisons are very favorable, I tried out a 10 turbine, 2400 row set, the time to calculate mean goes:
16.9 ms -> 6.86 ms
While the stat goes
24.4s -> 109 ms (note the unit change)
Also the previous version (when calculating stats) generates a lot of this warning:
/Users/pfleming/Projects/FLORIS/flasc/flasc/time_operations.py:112: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling frame.insert
many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use newframe = frame.copy()
df_ma[cols_reg_stats] = df_roll[cols_regular].agg(funs).copy()
I also made a quick double check on correctness and that was looking fine to me so going ahead with the merge once I figure out what test is failing
Ok all the remaining failed tests appeared to be related to ngrid=5 in the FLORIS input, pushed this down to 3 now all check, merging...
Ok think the last error is the core.base issue, merging...
Those timings are really impressive, thanks for pushing this @paulf81 and @misi9170 !!
Fixing two test issues for time_operations.py: 1) Issue caused by
pd.concat
. Solution: specify index of second dataframe inconcat
operation to prevent new rows being appended. Consequence: passes test as expected. 2) Issue caused by pandas version update (changes torolling()
). Solution: manually loop through timeseries to identify start and end indices for rolling windows. Consequence: passes test as expected, but now requires nested for loop which may slow down code for large datasets.