Neuraxio / Neuraxle

The world's cleanest AutoML library ✨ - Do hyperparameter tuning with the right pipeline abstractions to write clean deep learning production pipelines. Let your pipeline steps have hyperparameter spaces. Design steps in your pipeline like components. Compatible with Scikit-Learn, TensorFlow, and most other libraries, frameworks and MLOps environments.
https://www.neuraxle.org/
Apache License 2.0
608 stars 62 forks source link

Make BaseValidationSplitter consistent with current_ids. #475

Closed jeromebedard12 closed 3 years ago

jeromebedard12 commented 3 years ago

For: https://github.com/Neuraxio/Neuraxle/blob/3091bbc175be33b3b415386cf8a4dc79ce48d368/neuraxle/metaopt/auto_ml.py#L996-L1006

Suggestion:

@abstractmethod
def split(self, data_inputs, current_ids=None, expected_outputs=None, context: ExecutionContext = None) -> Tuple[
    List, List, List, List, List, List]:
    """
    Train/Test split data inputs and expected outputs.
    :param data_inputs: data inputs
    :param expected_outputs: expected outputs (optional)
    :param context: execution context (optional)
    :return: data_inputs_train, current_ids_train, expected_outputs_train, data_inputs_validation, current_ids_validation, expected_outputs_validation
    """
    pass

For: https://github.com/Neuraxio/Neuraxle/blob/3091bbc175be33b3b415386cf8a4dc79ce48d368/neuraxle/metaopt/auto_ml.py#L972-L974

Suggestion:

train_data_container = DataContainer(
    data_inputs=train_data_inputs, 
    current_ids=current_ids_train,
    expected_outputs=train_expected_outputs
)
validation_data_container = DataContainer(
    data_inputs=validation_data_inputs,
    current_ids=current_ids_validation,
    expected_outputs=validation_expected_outputs
)

For instance, if we end up with data_inputs 0, 1, 2 in train set and data_inputs 3, 4 in validation set, we might want to end up with: current_ids_train = ['0', '1', '2'] and current_ids_validation = ['3', '4'] (or whatever current_ids the former data_container had) and not being forced to have: current_ids_train = ['0', '1', '2'] and current_ids_validation = ['0', '1']

vincent-antaki commented 3 years ago

Hello @jeromebedard12 !

I think it's a fair point that you bring. One should not be stuck with the default ids by using a splitter. I'm even wondering why do we have two separate functions for splitting (split_data_container and split). It seems to me we'd be better off having a single split function with the current signature of split_data_container. What do you think?

Also, is this modification needed urgently on your side? I'm a bit swamped at the moment and would rather perform that modification next week as long as it's not a major inconvenience on your side.

jeromebedard12 commented 3 years ago

I think it's a fair point that you bring. One should not be stuck with the default ids by using a splitter. I'm even wondering why do we have two separate functions for splitting (split_data_container and split). It seems to me we'd be better off having a single split function with the current signature of split_data_container. What do you think?

Well I like the idea of split_data_container being well defined and not abstract. It's just a method that reconstructs the data_containers properly using the splits from the split method. I don't see how one would want this function to behave differently. It makes sense to me that the abstraction of the spliting logic is strictly nested inside the split method and not too broadly in the split_data_container method.

Also, is this modification needed urgently on your side? I'm a bit swamped at the moment and would rather perform that modification next week as long as it's not a major inconvenience on your side.

Next week is fine with me!

vincent-antaki commented 3 years ago

Closing this issue. Fix has been merge.