multivariable vs multidimensional basis representation

GAA-UAM / scikit-fda

Functional Data Analysis Python package

https://fda.readthedocs.io

BSD 3-Clause "New" or "Revised" License

298 stars 53 forks source link

multivariable vs multidimensional basis representation #255

Open alejandro-ariza opened 4 years ago

alejandro-ariza commented 4 years ago

Hi there,

I'm trying to get a bspline basis approximation for multivariable data. I'm having a NotImplementedError like the one illustrated in this example:

# import skfda modules and AEMET data example
from skfda.representation import basis, FDataGrid
from skfda.datasets import fetch_aemet

# get discretised functional data
data    = fetch_aemet()['data'].data_matrix
fdgrid  = FDataGrid(data)

# get bspline basis representation of data
bspline = basis.BSpline(n_basis=10)
fdbasis = fdgrid.to_basis(bspline)

# returns: NotImplementedError: Only support 1 dimension on the domain.

Just wondering whether this will be implemented any time soon, or whether you propose a workaround I'm not aware of?

Currently I tried to get basis representations for every dimension, and stack the resulting coefficient matrices along the third axis in a new fdbasis object. Something like this:

fdbasis.coefficients = np.dstack((fdbasis1.coefficients, fdbasis2.coefficients, fdbasis3.coefficients))

The problem is that I'm not able to edit other parameters of the fdbasis object, such as:

fdbasis.ndim = 3
# returns: AttributeError: can't set attribute
fdbasis.dim_codomain = 3
# returns: AttributeError: can't set attribute

Obviously, this incoherence between the coefficient matrix's shape and the parameters above result in ValueErrors, such as "cannot reshape array of size xxxxxx into shape (xx, xxxx, x)", when further functional data operations are done.

Any advice or implementation in the short term?

Thanks for this brilliant and promising repository,

Alejandro.

vnmabus commented 4 years ago

Maybe you should try using the recently added Tensor basis (https://fda.readthedocs.io/en/latest/modules/autosummary/skfda.representation.basis.Tensor.html#skfda.representation.basis.Tensor), which represent a multivariable function in the basis of tensor products of univariate basis functions.

Multivariable functions is currently an area not as tested and explored as univariate functions in scikit-fda, so if you have any problem do not hesitate to contact me again.

vnmabus commented 4 years ago

May I also ask why are you doing this?

# get discretised functional data
data    = fetch_aemet()['data'].data_matrix
fdgrid  = FDataGrid(data)

AEMET data has dim_domain=1 and dim_codomain=3, but as you are passing the data_matrix to FDataGrid and not the sample_points, it has (wrongly) assumed that the extra dimension of the matrix is in the domain (probably I should consider this a bug). Is this something you intended as an example, or are you actually trying to use the AEMET dataset?

alejandro-ariza commented 4 years ago

Hi @vnmabus , thanks for your reply.

I haven't checked yet the Tensor option, but regarding your last comment:

The AEMET data was just an example to illustrate the problem. I'm just trying to get a fdbasis object from multivariable data with the following array shape: 1st dim : observations 2nd dim : samples 3rd dim : variables

In any case, I just updated scikit-fda, and I got a different error, so still not working. Here it goes:

# import skfda modules and AEMET data example
from skfda.representation import basis, FDataGrid
from skfda.datasets import fetch_aemet

# EXAMPLE 1-VARIABLE DATA:

## get discretised functional data
data    = fetch_aemet()['data'].data_matrix[:,:,0]
fdgrid  = FDataGrid(data)

## get bspline basis representation of data
bspline = basis.BSpline(n_basis=10)
fdbasis = fdgrid.to_basis(bspline)
# this works great

# EXAMPLE 3-VARIABLE DATA:

## get discretised functional data
data    = fetch_aemet()['data'].data_matrix
fdgrid  = FDataGrid(data)

## get bspline basis representation of data
bspline = basis.BSpline(n_basis=10)
fdbasis = fdgrid.to_basis(bspline)
# returns: ValueError: The domain of the function has dimension 2 but the domain of the basis has dimension 1

Not sure how I should provide a bspline basis of dimension 2?

Not sure either I understood what you said about passing the "sample_points" to FDatagrid. I believed that FDatagrid accepts data itself, not data arguments. Please, let me know if I am wrong. I might be taking a wrong path to approximate multivariable data in a basis system...

vnmabus commented 4 years ago

Ok, let me answer your questions.

The AEMET data was just an example to illustrate the problem. I'm just trying to get a fdbasis object from multivariable data with the following array shape: 1st dim : observations 2nd dim : samples 3rd dim : variables

The FDataGrid attribute data_matrix for a dataset of functions $f: \mathbb{R}^p \to \mathbb{R}^q$ has shape (N, D_1, ..., D_p, q), where:

N is the number of functions in the dataset (called number of observations in Statistics and number of samples in Machine Learning, we use n_samples for compatibility with the naming used in scikit-learn).
D_1, ... D_p are the number of discretization points for each dimension (len(s) for s in sample_points).
- q is the number of coordinates of the output (1 for scalar valued functions, or greater for vector valued functions)

Not sure how I should provide a bspline basis of dimension 2?

I am not sure if there is another way to define a bspline basis of dimension 2, but you can use the Tensor basis to define a basis for multivariable functions using the tensor product of several univariable basis. For example:

from skfda.representation.basis import Tensor, BSpline

basis_x = BSpline(n_basis=3)
basis_y = BSpline(n_basis=2)

basis = Tensor([basis_x, basis_y])

In this case the elements of the basis would be of the form a[i](x) * b[j](y), where a[i] is the i-th element of the basis basis_x and b[j] is the j-th element of the basis basis_y.

Not sure either I understood what you said about passing the "sample_points" to FDatagrid. I believed that FDatagrid accepts data itself, not data arguments. Please, let me know if I am wrong. I might be taking a wrong path to approximate multivariable data in a basis system...

You can pass attributes such as sample_points in FDataGrid construction. In fact, if you do not pass sample_points for vector valued functions (as per your comment) it seems to assume that the last dimension of data_matrix is a domain dimension instead of a codomain one (I see that as a bug, I will try to fix this).

alejandro-ariza commented 4 years ago

Thanks for clarifications, I often struggle with terminology.

So, coming back to the question: How to approximate a 3-variable dataset to a functional data object?

I think I understood the utility of Tensor, but not clear how to implement it when one of the basis system corresponds to the "along-variables" dimension (if we can call that a dimension). I will illustrate the problem, once again with the AEMET data:

# EXAMPLE 3-VARIABLE AEMET DATA (temperature, precipitation, wind):

## get discretised functional data
data    = fetch_aemet()['data'].data_matrix  
fdgrid  = FDataGrid(data)

## get bspline basis representation of data
basis_x = basis.BSpline((0, 365), n_basis=10)  # time dimension     (365 days)
basis_y = # ???                                # variable dimension (3 ?)                                    
tensor  = basis.Tensor([basis_x, basis_y])
fdbasis = fdgrid.to_basis(tensor)

Here basis_x refers to a 10-basis system to approximate the variable function along the dimension "time", in this case 365 days.

I don't know how to approach the basis_y, though. If instead of temperature-precipitation-wind, we had only temperature at 3 different station heights, we would be talking about the physical dimension "height", which we could approximate with a BSpline basis, I guess? But these are different variables, not the same variable over two dimensions (time and height). So, how can I account for multivariable data with a functional approach? I need to have the three variables parameterized in the same functional data object, so that I can perform further analysis later on (ANOVA, PCA, clustering, etc.) based on the three variables, not just one.

I can not find my way out. Is this currently possible with scikit-FDA? If so, could you please illustrate this with this data example?

Thanks a lot in advance, sorry for troubling you too much!

vnmabus commented 4 years ago

But the AEMET dataset has only one dimension in the domain (time) and three dimensions in the codomain (temperature, precipitation, wind). In this case you can use the VectorValued basis (https://fda.readthedocs.io/en/latest/modules/autosummary/skfda.representation.basis.VectorValued.html) to represent this.

import skfda
from skfda.representation.basis import BSpline, VectorValued

## get discretised functional data
fdgrid = skfda.datasets.fetch_aemet()['data']

## get bspline basis representation of data
basis_temp = BSpline((0, 365), n_basis=10)
basis_prec = BSpline((0, 365), n_basis=10)
basis_wind = BSpline((0, 365), n_basis=10)
basis = VectorValued([basis_temp, basis_prec, basis_wind])
fdbasis = fdgrid.to_basis(basis)

The Tensor basis, however, is useful when you have multiple domain dimensions (for example surfaces or images in the bivariate case).

Once again the scalar functions are better tested than vector valued functions, so reach me if you encounter any problem.

alejandro-ariza commented 4 years ago

Thanks @vnmabus, this is getting interesting since I actually have to deal with both scenarios:

multiple variables running over a common dimension ( here approached with VectorValued).
same variable running over multiple dimensions (here approached with Tensor).

I think I managed to deal with the first scenario, but running into problems with the second one. It seems to me it's to do with bugs or not-implemented features. I have prepared an script, working with the same data example, so you and the audience can follow up:

# import AEMET data example and modules 
from skfda.datasets import fetch_aemet
from skfda.representation import FDataGrid
from skfda.representation.basis import BSpline, VectorValued, Tensor
import numpy as np

#------------------------------------------------------------------------------
# EXAMPLE: 1-DIMENSION & 3-VARIABLES (working)

# get discretised functional data
data      = fetch_aemet()['data'].data_matrix # temperature, precipitation, wind
arguments = np.arange(365)                    # 365 days
fdgrid    = FDataGrid(data, arguments)

# get a 10-bspline basis representation for each data variable
basis_temp = BSpline((0, 365), 10)
basis_prec = BSpline((0, 365), 10)
basis_wind = BSpline((0, 365), 10)
vvbasis    = VectorValued([basis_temp, basis_prec, basis_wind])

# convert discretised functional data into a basis object with:
#   1 dimension domain   : time (365 days)
#   3 dimension codomains: 3 variables running over the same dimension
#   coefficient's shape  : (73, 30) = (73, 10 basis * 3 variables). Goood!
fdbasis = fdgrid.to_basis(vvbasis)

#------------------------------------------------------------------------------
# EXAMPLE: 2-DIMENSIONS & 1-VARIABLE (not working)

# create new (fake) data, as it was temperature measured over:
#   73  stations            (1st axis) | n_samples
#   365 days                (2nd axis) | 1st dimension 
#   20  heigths from ground (3rd axis) | 2nd dimension
data_ = data[:, :, 0]
for i in range(1, 20):
    data_  = np.dstack((data_, data[:, :, 0] - i))
arguments_d = np.arange(365)
arguments_h = np.arange(20)

# Get discretised functional data:
# Here I had to figure out how to pass the 2 dimensions arguments. It 
# seems if not working since I get the following inconsistent parameters:
#   * dim_domain         : 2
#   * dim_codomain       : 1
#   * data_matrix's shape: (73, 365, 20, 1)
fdgrid_ = FDataGrid(data_, (arguments_d, arguments_h))

# Get tensor with 10-bspline basis representation of data for each dimension
basis_d = BSpline((0, 365), 10)
basis_h = BSpline((0,  20), 10)
tbasis  = Tensor([basis_d, basis_h])

# Convert discretised functional data into a basis object with:
#   2 dimension domains : time (365 days), and height (20 meters)
#   1 dimension codomain: 1 variable running over the 2 dimensions
#   coefficient's shape : (73, 100) != (73, 10 basis * 2 dimensions). Wrong?
# Not working since it inherits the same inconsistencies as in fdgrid_
fdbasis_ = fdgrid_.to_basis(tbasis)

Is there an alternative path to deal with multidimensional variables or I just need to wait for further implementations?

Thanks in advance.

vnmabus commented 4 years ago

Sorry, but in your example I think that it is working as intended. Why do you think that the parameters are inconsistent? The domain dimension is 2, as intended, and the shape of the matrix is right (n_samples, len(arguments_d), len(arguments_h), dim_codomain). Try plotting the first function to see if it has the shape that you want.

And in the basis case, the number of the basis required for representing the tensor product of the two bases is the product of the number of elements in each basis. For example, if the first basis was {1, x, x**2} and the second was {1, sin(y), cos(y)}, the elements of the multivariable basis using the tensor product would be {1, sin(y), cos(y), x, x * sin(y), x * cos(y), x**2, x**2 * sin(y), x**2 * cos(y)}, that is, 9 functions and not 6. Again you can try plotting the first function to see the effect.

This is in contrast with vector valued functions, in which the number of elements of the basis would be simply added, as the basis would be {1 * (1, 0), x * (1, 0), x**2 * (1, 0), 1 * (0, 1), sin(y) * (0, 1), cos(y) * (0, 1)}, where (1, 0) and (0, 1) are the unitary vectors in the x and y axes.

Finally, I think that for multivariable functions are more basis that can be used instead of the tensor product, such as radial basis functions (currently not implemented). The code is structured so that adding a new basis should be easy. Essentially you only have to provide the basis evaluation, and it should work (although there are several methods and multimethods that you can implement for the new basis to improve the performance). Thus, if you have an specific basis in mind you can implement it, and if it is general enough I could include it in the package.

alejandro-ariza commented 4 years ago

OK, got it. It seems nothing was wrong except me. As usual in scikit-FDA issues!

Hopefully your detailed explanations will be worth for other lost scikit-fda users apart of me? Sorry if I troubled you too much, it's just not as clear for me with the current documentation and my limited knowledge in FDA. In my case, real data examples does much better job than technical documentation, I often get "lost in translation" with domains, codomains, tensors, etc.

So, fdbasis.coefficients never looks like a 3D array, as fdgrid.data_matrix in the multivariable case. Instead, they are always 2D arrays with the number rows corresponding to the number of samples, and the number of columns (basis) equivalent to the sum of codomain's lengths (in case of the multivariable approach VectorValued), or the product of domain's lengths (in the case of the multidimensional approach Tensor). Is that right?

Thanks!

vnmabus commented 4 years ago

Yes, coefficients is always a 2D matrix. The rows are the functional observations and the columns are the elements of the basis (which for cases where Tensor or VectorValued is used, it is computed as you said).

Do not worry about the nomenclature. One thing that I have been observing during the development of the package is that apart from some standard names such as "domain" or "codomain" each branch of math tends to use their own nomenclature (or some things are not named at all), and it is often not very clear what do you mean when you say "multivariate functions" for example. I am trying to choose names that are reasonably clear for everyone, but it is a work in progress, and sometimes I feel they are not clear enough 😅. I am open to discuss naming (now while we are on versions 0.x) if anyone can argue in favour of other nomenclature.

alejandro-ariza commented 4 years ago

For me this is more like a self-learning process, I'm not an authority on the subject, so you'd better not follow my recommendations! I guess for those with my profile, —people wanting to apply statistics in a given field—, what helps a lot is a robust gallery with applied FDA (in economics, medical science, ecology, etc.). As long as you see an analysis routine you are familiar with, you'll get sorted out despite the hard terminology. It might be early for that now, as you are still on version 0.x, but I will be happy to contribute with applied routines on real data (in ecology) if you decide to further develop this idea.

In any case, thanks Carlos for your patience, even though it haven't been fruitful for the repo at the end.

vnmabus commented 4 years ago

Sounds like a great idea! I wanted to organize the examples/tutorials from a long time ago, but I usually priorize adding features and cleaning the code. I think it would be very useful to maybe create a section of examples/tutorials applied to particular branches of knowledge. Also, if your area has well known datasets or data sources maybe we can add functions to fetch the data, so it can be used in examples.

alejandro-ariza commented 4 years ago

Sounds good, let me check permissions and prepare metadata, I will provide you some nice examples. I'll get back to you in another thread.

Cheers!