GAA-UAM / scikit-fda

Functional Data Analysis Python package
https://fda.readthedocs.io
BSD 3-Clause "New" or "Revised" License
303 stars 58 forks source link

Scores for `FDataIrregular` objects #609

Open pcuestas opened 7 months ago

pcuestas commented 7 months ago

Motivation

Computing scores between FDataIrregular objects is a missing functionality of the package, and it can be useful when measuring the quality of conversions from irregular objects to basis representation.

Desired functionality

Compute scores when both y_true and y_pred are FDataIrregular objects.

How to implement each score?

There is a big problem when implementing scores for FDataIrregular: the mean of an FDataIrreuglar objects is not well defined. Most of the scores (for FData objects) involve computing the mean of an FData object.

We can surpass this issue in some of the cases when we want the "uniform_average" of the score and not the "raw_values". An example where we can avoid computing the mean is mean_absolute_error. The mean absolute error is defined this way: image To avoid having to calculate the mean of the FDataIrregular when multioutput="uniform_average", we can change the order of the mean and the integral. That is, instead of: image We can use: image Where $D_i$ and $V_i$ correspond to the domain of the $i$-th irregular curve and its lebesgue measure, respectively. I am not sure if this choice of not using the whole domain $D$ and its volume $V$ is the best, perhaps it would be less confusing to not bother computing the $V_i$'s, but I believe that the result would be less accurate, implicitly giving more weight to curves that have more spread-out points.

This idea can be applied to mean_absolute_error, mean_absolute_percentage_error, mean_squared_error and mean_squared_log_error. I am going to implement these in feature/scoring-fdatairregular.

r2_score

I believe that the r2_score can not be implemented for the FDataIrregular case, as its definition is to compare how well y_pred predicts the values of y_true in relation to how well the mean does, and the mean is not defined.

A possible implementation of r2_score for FDataIrregular objects would be to just compute the r2_score of (y_true.values, y_pred.values). However, I do not think this is a good option, as it disregards the functional structure of the curves, ignoring the points where they are measured and the mean of the values does not have the same meaning as in the other cases (FDataGrid and FDataBasis). Moreover, a user can manually call r2_score(y_true.values, y_pred.values) explicitly, so I do not think we should implement this score for irregular data, as it is not properly defined.

The case of explained_variance_score is very similar to that of r2_score.

ooodragon94 commented 7 months ago

hi, thank you for opening up the issue. I think this is another method where FDataIrregular is not well defined on.

I'm trying to apply FPCA using this code. https://fda.readthedocs.io/en/stable/auto_examples/plot_fpca_inverse_transform_outl_detection.html#sphx-glr-auto-examples-plot-fpca-inverse-transform-outl-detection-py

I have functions with R^3 -> R.

can FPCA be implemented on FDataIrregular too?

(or should I open up another issue?)

pcuestas commented 7 months ago

Hello, @ooodragon94.

As I understand, your case is very different from the one I outlined in this issue. There are ways to implement FPCA for irregular data, but we haven't implemented that yet, as FDataIrregular is a very recent addition to the package. You should definitely open another issue explaining the type of data that you have and what you want to do in detail.

The development efforts tend to be steered towards what users request, so it will be very useful to know what you would like to have in the package.

pcuestas commented 4 months ago

After discussing this issue with @vnmabus and Alberto Suárez, we concluded that the integral of a functional data object should always be the integral over its domain $D$, and not over the interval bounded by the endpoints of the discretization grid (called $D_i$ in the original issue description). This is discussed in depth in #619.

In https://github.com/GAA-UAM/scikit-fda/pull/610 , I have implemented the changes explained above; that is, dividing each integral by the measure $V_i$ of the smallest interval $D_i$ that contains the $i$-th curve's discretization points:

image

However, once the integral of discretized datasets is properly defined #619 (over the domain of the functional data object), these scores must be redefined so that the integrals are divided by the domain's measure: $V$, instead of $V_i$. For example, the MAE formula will be:

$$MAE = \frac{1}{\sum wi}\sum{i=1}^N w_i \frac{1}{V}\int_D |X_i(t) - \hat X_i(t)|\ dt.$$