GAA-UAM / scikit-fda

Functional Data Analysis Python package
https://fda.readthedocs.io
BSD 3-Clause "New" or "Revised" License
309 stars 58 forks source link

FisherRaoElasticRegistration fit with template only #528

Closed ego-thales closed 1 year ago

ego-thales commented 1 year ago

Enables using FisheRaoElasticRegistration(template=template).fit() without any X data if a template was provided.

vnmabus commented 1 year ago

May I ask which use case do you have in mind for this functionality?

ego-thales commented 1 year ago

If one only has a unique template, then FisherRaoElasticRegistration().fit(template) works fine. But it is not the case if one uses multiple templates. A use case might be for example having a family of samples X, and a transformation t -> X(t) which could be a small deformation, and you could imagine wanting to observe the overall influence of t on the amplitude of X. Just an idea, there might be lots of others.

vnmabus commented 1 year ago

Sorry, maybe I have not been clear. I am asking what would be the purpose of having a call to fit() without data. If you want to just transform one group of curves, you can use fit_transform(). If you have a train/test partition you have data to be used in fit() (even if the call does nothing). Am I misunderstanding something?

ego-thales commented 1 year ago

The main point is syntactical. The word "fit" makes the user believe that something is happening when it's not. So it would seem unnatural / misleading (to me), for example, to write

>>> # Call `fit_transform` everytime
>>> op = FisherRaoElasticRegistration(template=template)
>>> for t in range(N):
...     transformed = op.fit_transform(X[t])
...     # Do something

or even

>>> # Call `fit` once on random data
>>> op = FisherRaoElasticRegistration(template=template)
>>> op.fit(template_or_whatever)  # Does nothing with `template_or_whatever` since `op` already has `self.template_`
>>> for t in range(N):
...     transformed = op.transform(X[t])
...     # Do something

Instead, I think the user would expect the following syntax to be working, acknowledging that the fit() call is necessary for API coherence:

>>> # More natural syntax
>>> op = FisherRaoElasticRegistration(template=template).fit()
>>> for t in range(N):
...     transformed = op.transform(X[t])
...     # Do something

This is certainly how I encountered the problem the first time: trying the third syntax and getting errors. I thought "wait, does FisherRaoElasticRegistration not do what I thought it would do? What does op.fit(whatever) do with whatever?". Because of this, I had to go back to docs, and then to source code to verify that fit actually behaved like I expected, and didn't "logically" need any data.

Finally, I think it improves conciseness:

>>> op = FisherRaoElasticRegistration(template=template)
>>> op.fit(template_or_whatever)

takes 2 lines, and

>>> op = FisherRaoElasticRegistration(template=template).fit(template_or_whatever)

looks redundant while probably often going past 80 characters with indentation.

Thanks for your answers by the way, and more broadly, for this package ♥️

vnmabus commented 1 year ago

I see your concerns. It seems that when you pass a template, FisherRaoElasticRegistration becomes a stateless transformer. In that case, and following scikit-learn, we should probably allow to NOT call fit (but be able to do so for additional validation). This requires that in that case we compute some things in transform (without storing them) if they were not previously stored in a call to fit. Note that scikit-learn still recommends to use fit_transform in that case.

Although the changes are likely to be small, I would need a test that checks this case of not calling fit, to prevent future regressions.

I prefer not calling fit (rather than calling it without parameters). The later pattern is not used in scikit-learn and would raise more questions. Moreover, from a typing perspective, it would allow to pass None as X, which is something that we do not want.

As an additional warning, note that if we do this, the template_ attribute will not be created if you do not call fit.

Also: skip the loop, you can call transform using all the data at once (and it will be faster).

ego-thales commented 1 year ago

Thanks for your inputs.

I agree that allowing None isn't ideal, so I think I also prefer being able to call transform (and fit_transform) without calling fit before. But still, fit is useful when users don't have template precomputed, so should it really be stateless?

I guess that in this case, much more needs to be done in __init__ since no new attribute is supposed to be added in fit according to scikit-learn.

Finally, the example I gave could not really skip the loop since for all we know, X[t+1] might depend on X[t].

vnmabus commented 1 year ago

I agree that allowing None isn't ideal, so I think I also prefer being able to call transform (and fit_transform) without calling fit before. But still, fit is useful when users don't have template precomputed, so should it really be stateless?

It would be stateless just in that case. For a non-precomputed template you will be required to call fit.

I guess that in this case, much more needs to be done in __init__ since no new attribute is supposed to be added in fit according to scikit-learn.

Doing work in __init__ is against scikit-learn API. For the stateless case, work has to be done in transform, without storing attributes. If you want to store the attributes you need to call fit or fit_transform.

Finally, the example I gave could not really skip the loop since for all we know, X[t+1] might depend on X[t].

How? I do not understand you here.

ego-thales commented 1 year ago

Doing work in __init__ is against scikit-learn API. For the stateless case, work has to be done in transform, without storing attributes. If you want to store the attributes you need to call fit or fit_transform.

Ok, it makes sense. I may get around updating the PR in accordance to our discussion if I have the time.

Finally, the example I gave could not really skip the loop since for all we know, X[t+1] might depend on X[t].

How? I do not understand you here.

Well, you can't rule out this possibility in general. The user might want to use X[t] info to determine what X[t+1] might be. Or you could imagine a while loop instead of a for loop, continually updating X based on its previous state. It is not important here anyway, it was just a side note on possible usage.

Thanks for your time. All the best!

vnmabus commented 1 year ago

I suppose you did not think that it is worthy enough at the end?

ego-thales commented 1 year ago

Sorry, I should have closed with a comment! I actually think it would be nice to implement the way we discussed, but I currently don't have enough time to get around doing this (as you saw, I only made the minor doc modif last week as opposed to last month haha). Since it is not really a big issue, I didn't want to leave an open issue for your project that I wasn't going to fix!