First basic preprocessing module

BSchilperoort commented 1 year ago

@jannesvaningen and @semvijverberg have a look! The workings of the module are demonstrated in the tutorial_preprocessing notebook

Currently implemented are:

A preprocessor class, which (on .fit):

Calculates the rolling mean of the data
Computes and stores the (linear) trend of the rolling-mean data.
Computes and stores the climatology of detrended rolling-mean data

When calling .transform

The input data is detrended using the stored trend
The climatology is subtracted from the detrended data

Example:

from s2spy import preprocess
preprocessor = preprocess.Preprocessor(
    rolling_window_size=25,
    detrend="linear",
    remove_climatology=True,
)

processed_data = preprocessor.fit_transform(input_data)

review-notebook-app[bot] commented 1 year ago

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

BSchilperoort commented 1 year ago

Overall I like where this is going, but in terms of design, I was hoping for something more composable.

Thanks for your comments @Peter9192 . Yang did have something akin to your example in mind, and it would be quite easy to implement.

However, I have not seen the need for this yet. Currently there are 4 input arguments for the entire preprocessor, which is sufficient to cover all we need. The rolling mean is always applied first by default, as @semvijverberg said this is best practice that users should do anyway. Note that the rolling mean is not applied when calling .transform.

I have not heard of any additional preprocessors that would need to be implemented (ones that need a .fit method at least), and preprocessing such as decadal/monthly sums can't have a fit method, just transform. At that point we would just be making a method.transform wrapper for xarray built-in methods.

Peter9192 commented 1 year ago

I have not heard of any additional preprocessors that would need to be implemented

What about dimensionality reduction (RGDR or something different) or feature extraction

At that point we would just be making a method.transform wrapper for xarray built-in methods.

I think this is exactly what we want. It will give us the possibility to control the things that can or cannot be set, and how they are implemented. With xarray you can do almost anything, and you can easily do it "wrong". I thought the idea for us was to provide a consensus/best-practice implementation, with a higher-level (more constrained) interface.

Also, we want things to be modular, right? Putting everything together sounds to me like you're addressing a single use case.

semvijverberg commented 1 year ago

Thanks for setting this up.

*I may not completely follow the reasoning of every detail in the conversation, but I wanted to share some thoughts I have.

First some jargon. With pre-processing, I and many climate scientists usually refer to only detrending and deseasonalizing, and not e.g. feature extraction methods. We often call those dimensionality reduction methods or clustering methods and I suggest that we want to stick to that convention. Pre-processing can also refer to handling NaNs and interpolation methods. Roughly speaking, it should encompass preparing the data before you feed it to (the first step of) your analysis.

The rolling mean is the most computational heavy job for the pre-processing. You want to avoid doing that twice, which I believed was the motivation to put everything under one class.
Like @BSchilperoort mentioned, (in principle) the deseasonalizing and detrending should be done on rolling mean (smoothened) data (if data is daily/weekly). I thought I also did my detrending on the smoothened data, but I recently discovered I did not do this, oeps. Anyway, I still think we should detrend on smoothened data, but there should be some visual verification on the implications (raw vs smoothened) and checking if it works as desired.
I think you know this but to be sure, an awesome new feature is that we allow the detrending/deseasonalizing to be done using only the training data.

BSchilperoort commented 1 year ago

Thanks for your input, @semvijverberg. I did have an talk with Peter before the holidays where we discussed why I implemented it this way. I think we're all mostly on the same page now.

One point Peter did raise, which I do agree with, is that the current name Preprocessor is not ideal. As you yourself say:

Pre-processing can also refer to handling NaNs and interpolation methods. Roughly speaking, it should encompass preparing the data before you feed it to (the first step of) your analysis.

In the same sense, resampling to the Calendar system is also a form of preprocessing.

Would you (or @jannesvaningen ) know of a good name to give this detrending+deseasonalizing step (instead of the current generic Preprocessor)?

semvijverberg commented 1 year ago

Thanks for your input, @semvijverberg. I did have an talk with Peter before the holidays where we discussed why I implemented it this way. I think we're all mostly on the same page now.

One point Peter did raise, which I do agree with, is that the current name Preprocessor is not ideal. As you yourself say:

Pre-processing can also refer to handling NaNs and interpolation methods. Roughly speaking, it should encompass preparing the data before you feed it to (the first step of) your analysis.

In the same sense, resampling to the Calendar system is also a form of preprocessing.

Would you (or @jannesvaningen ) know of a good name to give this detrending+deseasonalizing step (instead of the current generic Preprocessor)?

Yes, this is an excellent point. To be honest, I have no good answer. Perhaps simply class detrend_deseasonalize?

Peter9192 commented 1 year ago

Would you (or @jannesvaningen ) know of a good name to give this detrending+deseasonalizing step (instead of the current generic Preprocessor)?

Yes, this is an excellent point. To be honest, I have no good answer. Perhaps simply class detrend_deseasonalize?

What about "normalize" or "standardize" perhaps in conjunction with "TimeSeries", or a shortened "TS". Classnames could be TSNormalize or StandardizeTS, for example.

jannesvaningen commented 1 year ago

I am currently struggling with the same type of issues while adding functionality to the legacy code for out-of-sample preprocessing... I think you want to give the user the option to do one, two or all of the three methods that we mentioned, but only in a specific order:

rolling mean
deseasonalize
detrend

The user would have the freedom to do only detrending, but not do detrending and then do deseasonalizing. The aim here is to maintain the best practices. I have no idea how you could best accompany this in code. Do you agree? And if so, any ideas?

For the naming: in econometrics, estimating a trend, cycle or seasonality in a timeseries is called 'timeseries decomposition'. But this only concerns the 'fit' part of this exercise, not the 'transform' part. As far as I know, there is no good general word for all of the transform steps (detrend, deseasonalize) since they are considered separately (if there is no seasonality, why deseasonalize?). What @Peter9192 suggests comes close, but standardization or normalization can also be confusing because these are used in a context where the ts is scaled compared to its standard deviation.

If you follow all the steps you are left with a timeseries that should be stationary or integrated of order 0. The procedure to get such a timeseries is referred to as differencing.

So maybe TSDecompose or TSDifferencing?

Peter9192 commented 1 year ago

The user would have the freedom to do only detrending, but not do detrending and then do deseasonalizing. The aim here is to maintain the best practices. I have no idea how you could best accompany this in code. Do you agree? And if so, any ideas?

I think this should not be too difficult. The interface (what it looks like to the user) could be very different, but here's an example implementation:


class Workflow:
    def __init__(self):
        self._tasks = {}

    def add_task(self, task):
        self._tasks[task.name] = task

    def execute(self):
        order = ['A', 'B', 'C']
        for task in order:
            if task in self._tasks:
                self._tasks[task].execute()

class Task:
    def __init__(self, name):
        self.name = name

    def execute(self):
        print(f"Executing {self.name}")

workflow = Workflow()
workflow.add_task(Task('C'))
workflow.add_task(Task('B'))
workflow.execute()
> Executing B
> Executing C

semvijverberg commented 1 year ago

Would you (or @jannesvaningen ) know of a good name to give this detrending+deseasonalizing step (instead of the current generic Preprocessor)?

Yes, this is an excellent point. To be honest, I have no good answer. Perhaps simply class detrend_deseasonalize?

What about "normalize" or "standardize" perhaps in conjunction with "TimeSeries", or a shortened "TS". Classnames could be TSNormalize or StandardizeTS, for example.

I agree with @jannesvaningen, words like "normalize" and "standardize" refers to scaling timeseries. Removing the seasonal cycle is definitely different.

semvijverberg commented 1 year ago

If you follow all the steps you are left with a timeseries that should be stationary or integrated of order 0. The procedure to get such a timeseries is referred to as differencing.

So maybe TSDecompose or TSDifferencing?

The word 'Decompose' is also used by statsmodels. They use the word 'decompose' referring to both quantify the trend + seasonal cycle within one function. Btw, I think the statsmodels implementation is not well documented. I'm not sure what happens under the hood.

TSDifferencing is a well-known approach to create a stationary timeseries (no trend, stationary variability), but it is something very different from what we are doing. I find the method always a bit funky and not physically logical.

geek-yang commented 1 year ago

Just performed a test, it shows that by making the rolling window to be 1, it still performs rolling mean calculation, which is not computationally efficient. We should skip the rolling mean operation when the window is set to be None. Thanks for your notice about this issue @BSchilperoort 👍.

jannesvaningen commented 1 year ago

Nice work Bart! I added some comments, hope it wasn't too much. I added them as single comments, looks a bit messy now, sorry for that. I love how concise and elegant this preprocessor is. And it works fast and smoothly!

I know this is only the first basic preprocessing module, but I hope we can remind ourselves of all the steps that were described in the preprocessing discussion. I would love to also see implemented:

spatial selection (so selecting a certain spatial extent with lat and lon)
regridding
apply land/sea mask

BSchilperoort commented 1 year ago

Nice work Bart! I added some comments, hope it wasn't too much. I added them as single comments, looks a bit messy now, sorry for that. I love how concise and elegant this preprocessor is. And it works fast and smoothly!

Thanks for the review! Don't worry about the single comments. It's only a little bit less compact than a review with multiple comments.

geek-yang commented 1 year ago

Nice work Bart! I added some comments, hope it wasn't too much. I added them as single comments, looks a bit messy now, sorry for that. I love how concise and elegant this preprocessor is. And it works fast and smoothly!

Thanks for your review @jannesvaningen! I think it helps to identify a few issues that could be addressed to improve the preprocessor.

I know this is only the first basic preprocessing module, but I hope we can remind ourselves of all the steps that were described in the preprocessing discussion. I would love to also see implemented:

spatial selection (so selecting a certain spatial extent with lat and lon)

regridding

apply land/sea mask

Thanks for your suggestions. We've thought about this when designing the preprocessor. But given that the users are asked to prepare their data as data arrays, these steps could be easily addressed by calling the xarray (or xesmf) methods, for instance:

spatial selection da.sel(lat=slice(20, 30), lon=slice(110, 140))
for regridding, we can do simple interpolation with da.interp() or da.interp_like(), and for complex ones with xesmf (e.g. regridder = xe.Regridder(ds, ds_out, "bilinear") and then ds_out = regridder(ds))
apply land/sea mask da.where(np.isnan(mask))

I think we agree to keep this preprocessor light weight, so it is not very attractive to me to re-implement all these steps rather than adding a notebook to showcase everything. But it maybe useful to add all these steps in one place (e.g. a function to receive a dict incl. all the preprocessing steps, just like a workflow builder), which makes it easier for the user. I would like to support it in later iterations if necessary, depending on real usecases.

geek-yang commented 1 year ago

Nice work Bart! I added some comments, hope it wasn't too much. I added them as single comments, looks a bit messy now, sorry for that. I love how concise and elegant this preprocessor is. And it works fast and smoothly!

I know this is only the first basic preprocessing module, but I hope we can remind ourselves of all the steps that were described in the preprocessing discussion. I would love to also see implemented:

spatial selection (so selecting a certain spatial extent with lat and lon)

regridding

apply land/sea mask

A new issue #161 has been created to document these suggestions.

sonarcloud[bot] commented 1 year ago

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
2 Code Smells

98.0% Coverage
0.0% Duplication

geek-yang commented 1 year ago

Nice! Thanks for making the changes Yang. I have suggested some small modifications 😄

I also thought it would be nice (and more complete) to expose the stored parameters as @propertys. E.g.:
@property
def detrend(self):
    return self._detrend
etc. for the other parameters. This makes it easier to inspect this item later, for users who are for example working in a notebook.

We could also open an issue for this, and then also add a nicer repr which tells you the settings (and if the preprocessor is fit already, with an is_fit property).

Thanks for your review @BSchilperoort . Yes, I agree that we need to have a repr to inform the user what they have created. I just create an issue #162 for it. Let's do it in another PR.

semvijverberg commented 1 year ago

Cool! Thanks @BSchilperoort and @geek-yang!

AI4S2S / s2spy

First basic preprocessing module #152

Currently implemented are:

Example: