Neuraxio / Neuraxle

The world's cleanest AutoML library ✨ - Do hyperparameter tuning with the right pipeline abstractions to write clean deep learning production pipelines. Let your pipeline steps have hyperparameter spaces. Design steps in your pipeline like components. Compatible with Scikit-Learn, TensorFlow, and most other libraries, frameworks and MLOps environments.
https://www.neuraxle.org/
Apache License 2.0
608 stars 62 forks source link

Feature: Additional arguments to fit method in BaseStep #526

Closed subramaniam20jan closed 4 months ago

subramaniam20jan commented 2 years ago

The problem: Currently the neuraxle BaseStep has a fit method signature with only 2 parameters (data_inputs, expected_outputs). In libraries like keras it is possible to have additional arguments being passed to the fit method. This could be things like validation generators if the main data_inputs is a data generator as well.

This means if we want to wrap a keras model that takes two data generators, in a subclass of BaseStep, then it wouldnt be a straight forward implementation.

Solution: It would be extremely useful if an additional **kwargs is added to the base step fit method(in one or more of the Mixin classes) to enable passing arbitrary arguments to the custom estimator implementations.

guillaume-chevalier commented 2 years ago

You can use the ExecutionContext and the DataContainer to do this. You may want to take a look at the handle_fit and handle_transform methods.

subramaniam20jan commented 2 years ago

I am not sure I understand how this can be used without too much confusion.

What I want:

from neuraxle.base import BaseStepseStep
from neuraxle.pipeline import Pipeline
class KerasNeuraxleWrapper(BaseStep):
    def __init__(self,
                 model,
                 hyperparams: neuraxle.hyperparams.space.HyperparameterSamples = None,
                 hyperparams_space: neuraxle.hyperparams.space.HyperparameterSpace = None,
                 name: str = None,
                 savers: List[neuraxle.base.BaseSaver] = None,
                 hashers: List[neuraxle.base.BaseHasher] = None):
        self.model = model
        super().__init__(
            hyperparams=hyperparams,
            hyperparams_space=hyperparams_space,
            name=name,
            savers=savers,
            hashers=hashers,
        )

    def fit(self, data_input, expected_output, **kwargs):
        self.model = self.model.fit(x=data_input, y=expected_output, **kwargs)

    def transform(self, data):
        return self.model.transform(data)

km = KerasNeuraxleWrapper(keras_model)
pipe = Pipeline([km])
pipe.fit(input_data_generator, expected_output=None, validation_data=validation_data_generator)

The current flow I see is as follows:

pipeline.fit(data_input, expected_output) -> pipeline.fit_data_container(DACT(data_input, expected_output) --> _FittableStep.handle_fit(dact, cx) ---> pipeline._fit_data_container(dact, cx) ----> for each step: step.handle_fit(dact, cx)

Both the ExecutionContext and the DataContainer instances are generated inside the base Pipeline class. The best solution(hack) given the current setup could be to pass a dictionary as the data_input and then use that as follows.

from neuraxle.base import BaseStepseStep
from neuraxle.pipeline import Pipeline
class KerasNeuraxleWrapper(BaseStep):
    def __init__(self,
                 model,
                 hyperparams: neuraxle.hyperparams.space.HyperparameterSamples = None,
                 hyperparams_space: neuraxle.hyperparams.space.HyperparameterSpace = None,
                 name: str = None,
                 savers: List[neuraxle.base.BaseSaver] = None,
                 hashers: List[neuraxle.base.BaseHasher] = None):
        self.model = model
        super().__init__(
            hyperparams=hyperparams,
            hyperparams_space=hyperparams_space,
            name=name,
            savers=savers,
            hashers=hashers,
        )

    def fit(self, data_input, expected_output=None):
        self.model = self.model.fit(y=expected_output, **data_input)

    def transform(self, data):
        return self.model.transform(data)

km = KerasNeuraxleWrapper(keras_model)
pipe = Pipeline([km])
pipe.fit(data_input={x=input_data_generator, validation_data=validation_data_generator}, expected_output=None)

But this is not an elegant solution by far.

Am I missing something here though?

guillaume-chevalier commented 2 years ago

You probably want to mix together what's done in these examples:

It is recommended that you override the _fit_data_container method and not the fit method for your use case. Refer to this for overriding this method, and the same one equivalent for transforming as well:

guillaume-chevalier commented 2 years ago

Note that you will later probably need a saver to make your pipeline serializable. Here is some inspiration on how to do it properly:

stale[bot] commented 10 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs in the next 180 days. Thank you for your contributions.