Documentation: Loading pipelines for inference

joel-odlund commented 3 years ago

Reading about saving and loading, I find it hard to understand how to save and load a model in order to use it for inference. In particular, it's not clear to me how the setup() phase relates to saving and loading.

This page gives an overview of the lifecycle of a model.
Here it is implied that 'setup()' occurs after load. However, it does not seem like the setup method is being called anywhere, except when fitting a model.

It would be nice with documentation on how the lifecycle works for inference. For example:

am i supposed to call setup() manually, after load? If so,
- how do i recursively setup a pipeline without implementing it myself?
- is setup supposed to be run on a loaded state? this requires care, as one could easily overwrite lodaded state in setup. this should be documented.
am i supposed to run setup() before load?
- this seems unlikely looking at the flowchart
- does not work with the 'load()' method in ExecutionContext which returns a new instance
am i supposed to not run setup() before inference?
- im suspecting this is the idea
- it would be nice with documentation on this, with an example. It has implications on how both saving, loading and setup needs to be written.
- i cant make sense of the flowchart if this is the case.

vincent-antaki commented 3 years ago

Hello Joel!

Here is some useful information with regards to your question:

You usually shouldn't have to to call setup manually, although you can if want to.
You are not supposed to run setup before load.
It is expected that, on the first call of setup, the attribute self.is_initialized (which is False after the constructor call) is set to True. If you want to avoid overwriting behaviour on a setup call, all you need to do is add a condition over self.is_initialized before the sensitive block of code.
All steps that have childrens (MetaSteps and TruncableSteps) are supposed to recursively call setup on their children, with the notable exception of Pipeline instances.
Pipeline instances are a special case of setup, where the call to setup its children is delayed right before the fit call is performed on its children. This is important for steps that can't be setup right away (e.g. some intermediate value needs to be computed in a previous setup method is called).
In case Pipeline is not the root of your ML pipeline, then calling setup manually (or adding a call to it in your fit function) is expected. Regardless, setup will be called on fit when a Pipeline instance is reached.
In the current setup of saving and loading, all steps are supposed to have setup called before saving. In case self.is_initialized is False (i.e. the setup method has not been called) for a given step of the pipeline at the moment of saving, a call to the setup method will be forced. Note that this may change at some point, as we intend that self.is_initialized==true will no longer be required to save a step thus allowing save even if the pipeline hasn't had a fit call yet (see #470).

side note : I think the flowchart may be getting old a bit.

Specifically with regards to your 3 options, the third one is the intended usage. It is expected that setup is only called through pipeline fit calls. From there, here are a couple of options you have :

You could write a custom Saver (see BaseSaver) for your step that needs setupping after load.
You could add a call to setup in your transform function.
You could use a apply call (e.g. pipeline.apply('setup', context=context)) to force call setup on all on every steps in the pipeline. I'd recommend avoiding this option though as a step's setup function might be called multiple time (once through the apply, and multiple other time through its parents apply calls).

Overall, I agree with you that setup is poorly documented and might need to be revisited eventually.

Feel free to ask more questions if you have any, I'll be glad to help you. Cheers!

joel-odlund commented 3 years ago

Thank you. This brings some clarity, i will try out some of these ideas. some questions come to mind, that might be useful if you decide to revisit the documentation.

if setup is only done when fitting, why is setup needed at all? why not initialize things as part of fit?
if setup is not intended to to be run after load, does it mean that all heavy state, such as word vectors, are intended to be serialized in the pipeline and then loaded back again? This has some implications when training many models and duplicating data that could otherwise be shared. it also implies some code duplication between load and setup which could be annoying.

There are some nice alternatives you give for when and how to use the setup method. I do think however that its imprtant for Neuraxle and its wider adoption that there is one idomatic way of doing it, which is well understood and documented. For example, my current task is to use Neuraxle to implement a general purpose ML environment. Most individual components will not be written by me, but by others on the team that may not have intimate knowledge of Neuraxle, and there will be expectations on clear instructions on how to do things, and why.

I really think Neuraxle is something the community needs, thats why im bringing up these things. Its not to complain :)

vincent-antaki commented 3 years ago

Your input is greatly appreciated. When spending all day using the framework, we can sometime lose sight of how other users would approach the various concepts and abstractions (Well, I don't know about @guillaume-chevalier, but that's more than certainly my case). The framework is in constant evolution and sometime its development is tailored to what specific project we have; this may lead us to have blind spot, or at least a biased priority queue. Comments like yours are essential for us to keep a healthy list of what needs to be done, both in term of documentation and code.

For your first question, it first and foremost a question of proper function encapsulation. setup is called before fit and doesn't achieve the same purpose. Furthermore, I think setup used to be called after load, and that behaviour was changed throughout a project a while ago. I think Guillaume may have more information on that specific design choice.

As for your second question, we don't expect users to serialize heavy stuff. Usually, we recommend that heavy and/or shared stuff be handled through the ExecutionContext's services. This is another part of the code which might not be well explained yet as it is fairly recent. I'll refer you to the code since it's rather straightforward, but once again, feel free to ask question if you have any.

vincent-antaki commented 3 years ago

Hey @joel-odlund!

I brought up some questions to Guillaume about the design choices for the setup function and we've concluded that we'll be revisiting it for the next release (0.5.8). Things will differ quite a bit from what I've told you so far although it will not change anything with regards to the save/load aspect of your initial question.

This will be the expected behaviour for setup after the modifications:

There will be no recurrent setup calls.
Setup calls will be removed from pipeline fit calls.
Setup calls will be added as the first step of all handler functions execution (handle_fit, handle_fit_transform, handle_transform).
The setup calls will be conditional on the self.is_initialized attribute. Thus setup are expected to be executed only once. Coding overwriting behaviour as part of setup should therefore not be problematic.

Documentation will be changed accordingly. Please do not hesitate to give me your thoughts about that change if you feel like it.

Cheers!

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs in the next 180 days. Thank you for your contributions.

Neuraxio / Neuraxle

Documentation: Loading pipelines for inference #481