Open alejandro-ariza opened 4 years ago
Maybe you should try using the recently added Tensor
basis (https://fda.readthedocs.io/en/latest/modules/autosummary/skfda.representation.basis.Tensor.html#skfda.representation.basis.Tensor), which represent a multivariable function in the basis of tensor products of univariate basis functions.
Multivariable functions is currently an area not as tested and explored as univariate functions in scikit-fda, so if you have any problem do not hesitate to contact me again.
May I also ask why are you doing this?
# get discretised functional data
data = fetch_aemet()['data'].data_matrix
fdgrid = FDataGrid(data)
AEMET data has dim_domain=1
and dim_codomain=3
, but as you are passing the data_matrix
to FDataGrid
and not the sample_points
, it has (wrongly) assumed that the extra dimension of the matrix is in the domain (probably I should consider this a bug). Is this something you intended as an example, or are you actually trying to use the AEMET dataset?
Hi @vnmabus , thanks for your reply.
I haven't checked yet the Tensor option, but regarding your last comment:
The AEMET data was just an example to illustrate the problem. I'm just trying to get a fdbasis object from multivariable data with the following array shape: 1st dim : observations 2nd dim : samples 3rd dim : variables
In any case, I just updated scikit-fda, and I got a different error, so still not working. Here it goes:
# import skfda modules and AEMET data example
from skfda.representation import basis, FDataGrid
from skfda.datasets import fetch_aemet
# EXAMPLE 1-VARIABLE DATA:
## get discretised functional data
data = fetch_aemet()['data'].data_matrix[:,:,0]
fdgrid = FDataGrid(data)
## get bspline basis representation of data
bspline = basis.BSpline(n_basis=10)
fdbasis = fdgrid.to_basis(bspline)
# this works great
# EXAMPLE 3-VARIABLE DATA:
## get discretised functional data
data = fetch_aemet()['data'].data_matrix
fdgrid = FDataGrid(data)
## get bspline basis representation of data
bspline = basis.BSpline(n_basis=10)
fdbasis = fdgrid.to_basis(bspline)
# returns: ValueError: The domain of the function has dimension 2 but the domain of the basis has dimension 1
Not sure how I should provide a bspline basis of dimension 2?
Not sure either I understood what you said about passing the "sample_points" to FDatagrid. I believed that FDatagrid accepts data itself, not data arguments. Please, let me know if I am wrong. I might be taking a wrong path to approximate multivariable data in a basis system...
Ok, let me answer your questions.
The AEMET data was just an example to illustrate the problem. I'm just trying to get a fdbasis object from multivariable data with the following array shape: 1st dim : observations 2nd dim : samples 3rd dim : variables
The FDataGrid
attribute data_matrix
for a dataset of functions has shape (N, D_1, ..., D_p, q), where:
n_samples
for compatibility with the naming used in scikit-learn).len(s) for s in sample_points
).
Not sure how I should provide a bspline basis of dimension 2?
I am not sure if there is another way to define a bspline basis of dimension 2, but you can use the Tensor
basis to define a basis for multivariable functions using the tensor product of several univariable basis. For example:
from skfda.representation.basis import Tensor, BSpline
basis_x = BSpline(n_basis=3)
basis_y = BSpline(n_basis=2)
basis = Tensor([basis_x, basis_y])
In this case the elements of the basis would be of the form a[i](x) * b[j](y)
, where a[i]
is the i-th element of the basis basis_x
and b[j]
is the j-th element of the basis basis_y
.
Not sure either I understood what you said about passing the "sample_points" to FDatagrid. I believed that FDatagrid accepts data itself, not data arguments. Please, let me know if I am wrong. I might be taking a wrong path to approximate multivariable data in a basis system...
You can pass attributes such as sample_points
in FDataGrid
construction. In fact, if you do not pass sample_points
for vector valued functions (as per your comment) it seems to assume that the last dimension of data_matrix
is a domain dimension instead of a codomain one (I see that as a bug, I will try to fix this).
Thanks for clarifications, I often struggle with terminology.
So, coming back to the question: How to approximate a 3-variable dataset to a functional data object?
I think I understood the utility of Tensor, but not clear how to implement it when one of the basis system corresponds to the "along-variables" dimension (if we can call that a dimension). I will illustrate the problem, once again with the AEMET data:
# EXAMPLE 3-VARIABLE AEMET DATA (temperature, precipitation, wind):
## get discretised functional data
data = fetch_aemet()['data'].data_matrix
fdgrid = FDataGrid(data)
## get bspline basis representation of data
basis_x = basis.BSpline((0, 365), n_basis=10) # time dimension (365 days)
basis_y = # ??? # variable dimension (3 ?)
tensor = basis.Tensor([basis_x, basis_y])
fdbasis = fdgrid.to_basis(tensor)
Here basis_x
refers to a 10-basis system to approximate the variable function along the dimension "time", in this case 365 days.
I don't know how to approach the basis_y
, though. If instead of temperature-precipitation-wind, we had only temperature at 3 different station heights, we would be talking about the physical dimension "height", which we could approximate with a BSpline basis, I guess? But these are different variables, not the same variable over two dimensions (time and height). So, how can I account for multivariable data with a functional approach? I need to have the three variables parameterized in the same functional data object, so that I can perform further analysis later on (ANOVA, PCA, clustering, etc.) based on the three variables, not just one.
I can not find my way out. Is this currently possible with scikit-FDA? If so, could you please illustrate this with this data example?
Thanks a lot in advance, sorry for troubling you too much!
But the AEMET dataset has only one dimension in the domain (time) and three dimensions in the codomain (temperature, precipitation, wind). In this case you can use the VectorValued
basis (https://fda.readthedocs.io/en/latest/modules/autosummary/skfda.representation.basis.VectorValued.html) to represent this.
import skfda
from skfda.representation.basis import BSpline, VectorValued
## get discretised functional data
fdgrid = skfda.datasets.fetch_aemet()['data']
## get bspline basis representation of data
basis_temp = BSpline((0, 365), n_basis=10)
basis_prec = BSpline((0, 365), n_basis=10)
basis_wind = BSpline((0, 365), n_basis=10)
basis = VectorValued([basis_temp, basis_prec, basis_wind])
fdbasis = fdgrid.to_basis(basis)
The Tensor
basis, however, is useful when you have multiple domain dimensions (for example surfaces or images in the bivariate case).
Once again the scalar functions are better tested than vector valued functions, so reach me if you encounter any problem.
Thanks @vnmabus, this is getting interesting since I actually have to deal with both scenarios:
multiple variables running over a common dimension ( here approached with VectorValued
).
same variable running over multiple dimensions (here approached with Tensor
).
I think I managed to deal with the first scenario, but running into problems with the second one. It seems to me it's to do with bugs or not-implemented features. I have prepared an script, working with the same data example, so you and the audience can follow up:
# import AEMET data example and modules
from skfda.datasets import fetch_aemet
from skfda.representation import FDataGrid
from skfda.representation.basis import BSpline, VectorValued, Tensor
import numpy as np
#------------------------------------------------------------------------------
# EXAMPLE: 1-DIMENSION & 3-VARIABLES (working)
# get discretised functional data
data = fetch_aemet()['data'].data_matrix # temperature, precipitation, wind
arguments = np.arange(365) # 365 days
fdgrid = FDataGrid(data, arguments)
# get a 10-bspline basis representation for each data variable
basis_temp = BSpline((0, 365), 10)
basis_prec = BSpline((0, 365), 10)
basis_wind = BSpline((0, 365), 10)
vvbasis = VectorValued([basis_temp, basis_prec, basis_wind])
# convert discretised functional data into a basis object with:
# 1 dimension domain : time (365 days)
# 3 dimension codomains: 3 variables running over the same dimension
# coefficient's shape : (73, 30) = (73, 10 basis * 3 variables). Goood!
fdbasis = fdgrid.to_basis(vvbasis)
#------------------------------------------------------------------------------
# EXAMPLE: 2-DIMENSIONS & 1-VARIABLE (not working)
# create new (fake) data, as it was temperature measured over:
# 73 stations (1st axis) | n_samples
# 365 days (2nd axis) | 1st dimension
# 20 heigths from ground (3rd axis) | 2nd dimension
data_ = data[:, :, 0]
for i in range(1, 20):
data_ = np.dstack((data_, data[:, :, 0] - i))
arguments_d = np.arange(365)
arguments_h = np.arange(20)
# Get discretised functional data:
# Here I had to figure out how to pass the 2 dimensions arguments. It
# seems if not working since I get the following inconsistent parameters:
# * dim_domain : 2
# * dim_codomain : 1
# * data_matrix's shape: (73, 365, 20, 1)
fdgrid_ = FDataGrid(data_, (arguments_d, arguments_h))
# Get tensor with 10-bspline basis representation of data for each dimension
basis_d = BSpline((0, 365), 10)
basis_h = BSpline((0, 20), 10)
tbasis = Tensor([basis_d, basis_h])
# Convert discretised functional data into a basis object with:
# 2 dimension domains : time (365 days), and height (20 meters)
# 1 dimension codomain: 1 variable running over the 2 dimensions
# coefficient's shape : (73, 100) != (73, 10 basis * 2 dimensions). Wrong?
# Not working since it inherits the same inconsistencies as in fdgrid_
fdbasis_ = fdgrid_.to_basis(tbasis)
Is there an alternative path to deal with multidimensional variables or I just need to wait for further implementations?
Thanks in advance.
Sorry, but in your example I think that it is working as intended. Why do you think that the parameters are inconsistent? The domain dimension is 2, as intended, and the shape of the matrix is right (n_samples, len(arguments_d), len(arguments_h), dim_codomain)
. Try plotting the first function to see if it has the shape that you want.
And in the basis case, the number of the basis required for representing the tensor product of the two bases is the product of the number of elements in each basis. For example, if the first basis was {1, x, x**2}
and the second was {1, sin(y), cos(y)}
, the elements of the multivariable basis using the tensor product would be {1, sin(y), cos(y), x, x * sin(y), x * cos(y), x**2, x**2 * sin(y), x**2 * cos(y)}
, that is, 9 functions and not 6. Again you can try plotting the first function to see the effect.
This is in contrast with vector valued functions, in which the number of elements of the basis would be simply added, as the basis would be {1 * (1, 0), x * (1, 0), x**2 * (1, 0), 1 * (0, 1), sin(y) * (0, 1), cos(y) * (0, 1)}
, where (1, 0)
and (0, 1)
are the unitary vectors in the x and y axes.
Finally, I think that for multivariable functions are more basis that can be used instead of the tensor product, such as radial basis functions (currently not implemented). The code is structured so that adding a new basis should be easy. Essentially you only have to provide the basis evaluation, and it should work (although there are several methods and multimethods that you can implement for the new basis to improve the performance). Thus, if you have an specific basis in mind you can implement it, and if it is general enough I could include it in the package.
OK, got it. It seems nothing was wrong except me. As usual in scikit-FDA issues!
Hopefully your detailed explanations will be worth for other lost scikit-fda users apart of me? Sorry if I troubled you too much, it's just not as clear for me with the current documentation and my limited knowledge in FDA. In my case, real data examples does much better job than technical documentation, I often get "lost in translation" with domains, codomains, tensors, etc.
So, fdbasis.coefficients
never looks like a 3D array, as fdgrid.data_matrix
in the multivariable case. Instead, they are always 2D arrays with the number rows corresponding to the number of samples, and the number of columns (basis) equivalent to the sum of codomain's lengths (in case of the multivariable approach VectorValued
), or the product of domain's lengths (in the case of the multidimensional approach Tensor
). Is that right?
Thanks!
Yes, coefficients
is always a 2D matrix. The rows are the functional observations and the columns are the elements of the basis (which for cases where Tensor
or VectorValued
is used, it is computed as you said).
Do not worry about the nomenclature. One thing that I have been observing during the development of the package is that apart from some standard names such as "domain" or "codomain" each branch of math tends to use their own nomenclature (or some things are not named at all), and it is often not very clear what do you mean when you say "multivariate functions" for example. I am trying to choose names that are reasonably clear for everyone, but it is a work in progress, and sometimes I feel they are not clear enough 😅. I am open to discuss naming (now while we are on versions 0.x) if anyone can argue in favour of other nomenclature.
For me this is more like a self-learning process, I'm not an authority on the subject, so you'd better not follow my recommendations! I guess for those with my profile, —people wanting to apply statistics in a given field—, what helps a lot is a robust gallery with applied FDA (in economics, medical science, ecology, etc.). As long as you see an analysis routine you are familiar with, you'll get sorted out despite the hard terminology. It might be early for that now, as you are still on version 0.x, but I will be happy to contribute with applied routines on real data (in ecology) if you decide to further develop this idea.
In any case, thanks Carlos for your patience, even though it haven't been fruitful for the repo at the end.
Sounds like a great idea! I wanted to organize the examples/tutorials from a long time ago, but I usually priorize adding features and cleaning the code. I think it would be very useful to maybe create a section of examples/tutorials applied to particular branches of knowledge. Also, if your area has well known datasets or data sources maybe we can add functions to fetch the data, so it can be used in examples.
Sounds good, let me check permissions and prepare metadata, I will provide you some nice examples. I'll get back to you in another thread.
Cheers!
Hi there,
I'm trying to get a bspline basis approximation for multivariable data. I'm having a
NotImplementedError
like the one illustrated in this example:Just wondering whether this will be implemented any time soon, or whether you propose a workaround I'm not aware of?
Currently I tried to get basis representations for every dimension, and stack the resulting coefficient matrices along the third axis in a new fdbasis object. Something like this:
The problem is that I'm not able to edit other parameters of the fdbasis object, such as:
Obviously, this incoherence between the coefficient matrix's shape and the parameters above result in
ValueError
s, such as "cannot reshape array of size xxxxxx into shape (xx, xxxx, x)", when further functional data operations are done.Any advice or implementation in the short term?
Thanks for this brilliant and promising repository,
Alejandro.