firefly-cpp / NiaAML

Python automated machine learning framework.
MIT License
32 stars 12 forks source link

Support for regression tasks and feature selection #69

Closed hanamthang closed 7 months ago

hanamthang commented 2 years ago

Thank you very much for your hard work on creating a good python package like NiaAML. Could you please consider to support the regression tasks and feature selection with NiaAML?

I use remotely-based data (satellite image) to retrieve the biophysical parameter (blue carbon ;-)) through the machine learning regression and need to select the most contributed features from a suite of input features.

Many thanks, Thang

firefly-cpp commented 8 months ago

@hanamthang; @LaurenzBeck is now working on regressions.

LaurenzBeck commented 8 months ago

Some things I identified to implement the new feature:

βœ… Checklist

LaurenzBeck commented 7 months ago

I just had my first read trhough the README, documentation and the tests.

πŸ”₯Problem

Imho. the codebase has a rather high coupling to the classification task specifics and not the highest cohesion. There are two options for implementing the regression feature and I want to get feedback on which path to take.

pipeline_optimizer = PipelineOptimizer(
    data=data_reader,
    classifiers=['AdaBoost', 'Bagging', 'MultiLayerPerceptron', 'RandomForest', 'ExtremelyRandomizedTrees', 'LinearSVC'],
    feature_selection_algorithms=['SelectKBest', 'SelectPercentile', 'ParticleSwarmOptimization', 'VarianceThreshold'],
    feature_transform_algorithms=['Normalizer', 'StandardScaler']
)

🧐 Options

  1. Add new abstractions and hierarchies to better differenciate between different tasks semantically

This options entails adding classes like Estimator or Predictor and specific child classes like Classifier and Regressor. We would also need new hierarchies for the tasks and metrics.

pipeline_optimizer = PipelineOptimizer(
    data=data_reader,
    estimators=['SVMRegressor', 'LinearRegressor'],
    task="regression",
    feature_selection_algorithms=['SelectKBest', 'SelectPercentile', 'ParticleSwarmOptimization', 'VarianceThreshold'],
    feature_transform_algorithms=['Normalizer', 'StandardScaler']
)
  1. Add new components in a less invasive way as existing components

This option is way less invasive, as it merely adds components. We do not need to change the API in major ways.

➑️ Rationale

Given that Regression was not the main task when designing the package, the fact that the semantic inconsistencies when sticking to the classifier wording are not that critical and the fact that the scope of my project is quite limited, I have a slight preference for option 2.

What do you say @firefly-cpp ?

firefly-cpp commented 7 months ago

Thanks. I totally support the second option.

LaurenzBeck commented 7 months ago

The coupling to the classification specifics is higher than I originally thought. I also have to adapt the feature selection and pipeline optimization part, since those currently assume fitness functions to deliver values between 0 and 1. This will take me some time, sry for the slow progress on this ticket.

LaurenzBeck commented 7 months ago

I need some help in understanding https://github.com/firefly-cpp/NiaAML/blob/master/niaaml/pipeline.py#L468 - https://github.com/firefly-cpp/NiaAML/blob/master/niaaml/pipeline.py#L491

for i in params_all:
            args = dict()
            for key in i[0]:
                if i[0][key] is not None:
                    if isinstance(i[0][key].value, MinMax):
                        val = (
                            solution_vector[solution_index] * i[0][key].value.max
                            + i[0][key].value.min
                        )
                        if (
                            i[0][key].param_type is np.intc
                            or i[0][key].param_type is int
                            or i[0][key].param_type is np.uintc
                            or i[0][key].param_type is np.uint
                        ):
                            val = i[0][key].param_type(np.floor(val))
                            if val >= i[0][key].value.max:
                                val = i[0][key].value.max - 1
                        args[key] = val
                    else:
                        args[key] = i[0][key].value[
                            get_bin_index(
                                solution_vector[solution_index], len(i[0][key].value)
                            )
                        ]
                solution_index += 1
            if i[1] is not None:
                i[1].set_parameters(**args)

-> this seems to be some custom (unfortunately undocumented) preprocessing of the parameter configurations.

I do understand the need to call component.set_parameters(**args) in the framework, but I do not understand:

  1. the connection/need/purpose of the solution_vector
  2. the logic behind the preprocessing for the MinMax case
  3. the need for the get_bin_index function

Could you help me with some clarifications @firefly-cpp ? πŸ™

My first intuition was removing the params preprocessing part, but this is dangerous if I do not understand those parts...

LaurenzBeck commented 7 months ago

I found the explanation here: https://niaaml.readthedocs.io/en/latest/getting_started.html#optimization-process-and-parameter-tuning

firefly-cpp commented 7 months ago

Thanks, @LaurenzBeck, for all the hard work.

As I stated before, documentation is not in the best shape, and thus, it should be modified and updated as soon as possible.