Fix issue #21: [BUG] Failing tests due to pandas pd.concat

misi9170 commented 2 years ago

Fixing two test issues for time_operations.py: 1) Issue caused by pd.concat. Solution: specify index of second dataframe in concat operation to prevent new rows being appended. Consequence: passes test as expected. 2) Issue caused by pandas version update (changes to rolling()). Solution: manually loop through timeseries to identify start and end indices for rolling windows. Consequence: passes test as expected, but now requires nested for loop which may slow down code for large datasets.

Bartdoekemeijer commented 2 years ago

Thanks for tackling this, Misha!

Could we rewrite:

# Rewrite to avoid fragmentation
df_out = pd.concat([df_out, pd.DataFrame(values_median, index=df_out.index, columns=["{:s}_median".format(c) for c in cols_angular])], axis=1)
df_out = pd.concat([df_out, pd.DataFrame(values_min, index=df_out.index, columns=["{:s}_min".format(c) for c in cols_angular])], axis=1)
df_out = pd.concat([df_out, pd.DataFrame(values_max, index=df_out.index, columns=["{:s}_max".format(c) for c in cols_angular])], axis=1)
df_out = pd.concat([df_out, pd.DataFrame(values_std, index=df_out.index, columns=["{:s}_std".format(c) for c in cols_angular])], axis=1)

to

# Rewrite to avoid fragmentation
df_out = pd.concat(
    [
        df_out,
        pd.DataFrame(values_median, index=df_out.index, columns=["{:s}_median".format(c) for c in cols_angular]),
        pd.DataFrame(values_min, index=df_out.index, columns=["{:s}_min".format(c) for c in cols_angular]),
        pd.DataFrame(values_max, index=df_out.index, columns=["{:s}_max".format(c) for c in cols_angular]),
        pd.DataFrame(values_std, index=df_out.index, columns=["{:s}_std".format(c) for c in cols_angular])
    ],
    axis=1
)

This could speed things up a bit, especially if df_out contains a large number of data. Similar things could be done in other parts of the code where the df_out variable is concatenated multiple times in a row.

It would also be good to benchmark the performance of the fix in time_operations.py for large datasets. My previous solution in finding the data indices corresponding to the time-averaging windows felt very inefficient, but it was much faster than for-loops for large datasets iirc. I found the time efficiency to be a breaking issue when working with high-resolution data (1Hz) and hence was sticking to the previous approach. I do feel like there must be a function inside Pandas to just provide us with that information, but I was not able to find it when I implemented this code...

paulf81 commented 2 years ago

hi @Bartdoekemeijer and @misi9170 I'd like to jump in here! I think you're right @Bartdoekemeijer there are some good vectorized pandas functions for a lot of things we do, I wanted to propose working our way through the files and redoing some things,

I just finished this book: https://store.metasnake.com/effective-pandas-book (there's a kind of short summary of key concepts here (https://www.aidancooper.co.uk/pandas-anti-patterns/ )) which I think presents a nicely comprehensive approach to pandas coding I was hoping to adopt in FLASC. In my own work I've been hitting out-of-memory errors. It seems like we could make progress switching to this sort of chained, not mutated, workflow, and focus on using vectorized calls. Maybe more broad than this pull request but I thought nice to touch base on what I was thinking

Maybe we need to start with setting up the time-benchmarking functions and then start working the changes through,

Bartdoekemeijer commented 2 years ago

Thanks for the inputs @paulf81! I'm a big proponent of revisiting the existing code and changing it to vectorized calls at much as we can! Also, I had tried to code up everything to prevent memory issues. I think many of the FLASC scripts can be run sequentially, so you can run the same script for chunks of the data, e.g., 1 week or 1 month of data at a time. Though again, happy to have that code revisited and think together!

paulf81 commented 2 years ago

Thanks @Bartdoekemeijer ! Yeah, definitely didn't mean to say there was any bugs or anything, I was just thinking more of opportunities for reprogramming pandas away from some things I think can lead to subtle issues. An example is when you get that "setting value with a copy" warning, we typically add a .copy() but I believe there is another approach. I just uploaded a notebook where I propose a replacement for the moving average function, I think I've cut the time down some but also following these new guidelines I think I got some general best practices in, what I was trying to do:

1) No mutations, or code that changes existing dataframes,

df['new_col'] = x

I think is a mutation

2) Chained operations so that all temporary memory assignments get released when the operation ends with an assignment of a dataframe

3) Trying wherever possible to use whole dataframe operations with base functions, versus apply calls of specific functions

Anyway, I'm still groping my way through all this new stuff, but what I understand is that for example that "setting copy with ..." warning is indicative of a possible unrobustness, the behavior of the code can be unpredictable but this style of coding is should never have those issues.

If you don't mind to look over the notebook (@Bartdoekemeijer and @misi9170 ) would be curious what you think!!

Bartdoekemeijer commented 2 years ago

Thanks for doing this work, @paulf81. These speed-ups and better usage of Pandas will surely help a lot!

Scrolling through the notebook, I see there are a couple differences in the solutions between your and my old method.

When I calculate the _min and _max values for the angular columns, I first shift all the data to be zero around the mean. For example, if you have an array of [358.0, 1.0, 4.0], then the mean value is 1.0, the min is 358.0 and the max is 4.0 in my approach. I think this makes sense in the angular sense, essentially saying min and max are the values to the left-most and right-most of the mean on the unit circle. If you take the simple min and max of the array, I can foresee issues with for example the following array: [350.0, 359.0, 1.0, 10.0]. I think this explains the difference in those values between your and my old approach. I'd love to hear your thoughts on this.
I am not sure why the std values differ, but I think it would be worth investigating.

misi9170 commented 2 years ago

Hi @paulf81, thanks for the pandas updates---they're a great example for me to work from for chained operations! I've added a couple of new versions of df_movingaverage_new to your notebook. These handle the standard deviation of a circular quantity, and also match rather closely with @Bartdoekemeijer's previous implementation (see plots at end of notebook). ..._new_2 uses scipy's circstd() directly with a apply, but this is slow, so I broke it down into the basic operations in ...new_3 which is considerably faster (I still have to apply a numpy.log operation, but this does not seem to be too detrimental).

I have not made any changes to the _max, _min, or _median operations. What would be the use case for a rolling minimum, maximum, or median?

Bartdoekemeijer commented 2 years ago

I have not made any changes to the _max, _min, or _median operations. What would be the use case for a rolling minimum, maximum, or median?

That's a valid question! I added these since I found similar statistics in existing downsampled SCADA data. They could be helpful in figuring out whether the high-resolution data contained outliers making the time-averaged mean unreliable, but then again you could also just use the standard deviation for that. Perhaps we can just leave the min, max and median out for the time being, and we could always consider reintroducing them if there's a real need for it?

paulf81 commented 2 years ago

Great job @misi9170! I just went through the notebook and really liked what you did, and appreciated the new plots at the end, very convincing!!

Then to @Bartdoekemeijer 's points, I would vote in favor of leaving them out (or I guess more specifically leaving them "uncircular") and seeing if we hit a need down the road,

@misi9170 Would you mind to propose changes now to the code following your df_movingaverage_new_3?

Bartdoekemeijer commented 2 years ago

Thanks Misha, Paul!! Could you quickly see how these changes hold up in terms of speed for large datasets before merging it in? If that's consistent, please feel free to merge in!

paulf81 commented 2 years ago

The timing comparisons are very favorable, I tried out a 10 turbine, 2400 row set, the time to calculate mean goes:

16.9 ms -> 6.86 ms

While the stat goes

24.4s -> 109 ms (note the unit change)

Also the previous version (when calculating stats) generates a lot of this warning: /Users/pfleming/Projects/FLORIS/flasc/flasc/time_operations.py:112: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling frame.insert many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use newframe = frame.copy() df_ma[cols_reg_stats] = df_roll[cols_regular].agg(funs).copy()

I also made a quick double check on correctness and that was looking fine to me so going ahead with the merge once I figure out what test is failing

paulf81 commented 2 years ago

Ok all the remaining failed tests appeared to be related to ngrid=5 in the FLORIS input, pushed this down to 3 now all check, merging...

paulf81 commented 2 years ago

Ok think the last error is the core.base issue, merging...

Bartdoekemeijer commented 2 years ago

Those timings are really impressive, thanks for pushing this @paulf81 and @misi9170 !!

NREL / flasc

Fix issue #21: [BUG] Failing tests due to pandas pd.concat #23