Support for regression tasks and feature selection

hanamthang commented 2 years ago

Thank you very much for your hard work on creating a good python package like NiaAML. Could you please consider to support the regression tasks and feature selection with NiaAML?

I use remotely-based data (satellite image) to retrieve the biophysical parameter (blue carbon ;-)) through the machine learning regression and need to select the most contributed features from a suite of input features.

Many thanks, Thang

firefly-cpp commented 8 months ago

@hanamthang; @LaurenzBeck is now working on regressions.

LaurenzBeck commented 8 months ago

Some things I identified to implement the new feature:

✅ Checklist

[x] understand core APIs and concepts for the classification tasks
[x] understand data handling in NiaAML
[x] 📈 research on regression tasks + decide on libraries to use
[x] 📄 implement data handling for continuous targets
[x] 🧑‍💻 implement new regression components
[x] 🧪 test implementation
[x] 📖 document implementation

LaurenzBeck commented 7 months ago

I just had my first read trhough the README, documentation and the tests.

🔥Problem

Imho. the codebase has a rather high coupling to the classification task specifics and not the highest cohesion. There are two options for implementing the regression feature and I want to get feedback on which path to take.

pipeline_optimizer = PipelineOptimizer(
    data=data_reader,
    classifiers=['AdaBoost', 'Bagging', 'MultiLayerPerceptron', 'RandomForest', 'ExtremelyRandomizedTrees', 'LinearSVC'],
    feature_selection_algorithms=['SelectKBest', 'SelectPercentile', 'ParticleSwarmOptimization', 'VarianceThreshold'],
    feature_transform_algorithms=['Normalizer', 'StandardScaler']
)

🧐 Options

Add new abstractions and hierarchies to better differenciate between different tasks semantically

This options entails adding classes like Estimator or Predictor and specific child classes like Classifier and Regressor. We would also need new hierarchies for the tasks and metrics.

➕ more "correct", we do not have to write sth like classifier="SVMRegressor"
➖ given the high coupling and low cohesion, adding the additional abtractions presents a lot of additional effort (for just being more correct)
➖ this presents a BREAKING change, which requires a major release, extensive documentation and even then, users might be frustrated if their pipelines fail just because they updated their package

pipeline_optimizer = PipelineOptimizer(
    data=data_reader,
    estimators=['SVMRegressor', 'LinearRegressor'],
    task="regression",
    feature_selection_algorithms=['SelectKBest', 'SelectPercentile', 'ParticleSwarmOptimization', 'VarianceThreshold'],
    feature_transform_algorithms=['Normalizer', 'StandardScaler']
)

Add new components in a less invasive way as existing components

This option is way less invasive, as it merely adds components. We do not need to change the API in major ways.

➕ potentially doable as a minor release, so users can update NiaAML safely without breaking their classification pipelines
➕ documentation does not have to be changed dramatically
➕easier to test and make sure that the new components/implemetnations don't interfere with the existing implementations

➖in some places, we have to accept semantic inconsistiencies like:

pipeline_optimizer = PipelineOptimizer(
data=data_reader,
classifiers=['SVMRegressor', 'LinearRegressor'],
feature_selection_algorithms=['SelectKBest', 'SelectPercentile', 'ParticleSwarmOptimization', 'VarianceThreshold'],
feature_transform_algorithms=['Normalizer', 'StandardScaler']
)

➡️ Rationale

Given that Regression was not the main task when designing the package, the fact that the semantic inconsistencies when sticking to the classifier wording are not that critical and the fact that the scope of my project is quite limited, I have a slight preference for option 2.

What do you say @firefly-cpp ?

firefly-cpp commented 7 months ago

Thanks. I totally support the second option.

LaurenzBeck commented 7 months ago

The coupling to the classification specifics is higher than I originally thought. I also have to adapt the feature selection and pipeline optimization part, since those currently assume fitness functions to deliver values between 0 and 1. This will take me some time, sry for the slow progress on this ticket.

LaurenzBeck commented 7 months ago

I need some help in understanding https://github.com/firefly-cpp/NiaAML/blob/master/niaaml/pipeline.py#L468 - https://github.com/firefly-cpp/NiaAML/blob/master/niaaml/pipeline.py#L491

for i in params_all:
            args = dict()
            for key in i[0]:
                if i[0][key] is not None:
                    if isinstance(i[0][key].value, MinMax):
                        val = (
                            solution_vector[solution_index] * i[0][key].value.max
                            + i[0][key].value.min
                        )
                        if (
                            i[0][key].param_type is np.intc
                            or i[0][key].param_type is int
                            or i[0][key].param_type is np.uintc
                            or i[0][key].param_type is np.uint
                        ):
                            val = i[0][key].param_type(np.floor(val))
                            if val >= i[0][key].value.max:
                                val = i[0][key].value.max - 1
                        args[key] = val
                    else:
                        args[key] = i[0][key].value[
                            get_bin_index(
                                solution_vector[solution_index], len(i[0][key].value)
                            )
                        ]
                solution_index += 1
            if i[1] is not None:
                i[1].set_parameters(**args)

-> this seems to be some custom (unfortunately undocumented) preprocessing of the parameter configurations.

I do understand the need to call component.set_parameters(**args) in the framework, but I do not understand:

the connection/need/purpose of the solution_vector
the logic behind the preprocessing for the MinMax case
the need for the get_bin_index function

Could you help me with some clarifications @firefly-cpp ? 🙏

My first intuition was removing the params preprocessing part, but this is dangerous if I do not understand those parts...

LaurenzBeck commented 7 months ago

I found the explanation here: https://niaaml.readthedocs.io/en/latest/getting_started.html#optimization-process-and-parameter-tuning

firefly-cpp commented 7 months ago

Thanks, @LaurenzBeck, for all the hard work.

As I stated before, documentation is not in the best shape, and thus, it should be modified and updated as soon as possible.

firefly-cpp / NiaAML