Closed hanamthang closed 7 months ago
@hanamthang; @LaurenzBeck is now working on regressions.
Some things I identified to implement the new feature:
I just had my first read trhough the README, documentation and the tests.
Imho. the codebase has a rather high coupling to the classification task specifics and not the highest cohesion. There are two options for implementing the regression feature and I want to get feedback on which path to take.
pipeline_optimizer = PipelineOptimizer(
data=data_reader,
classifiers=['AdaBoost', 'Bagging', 'MultiLayerPerceptron', 'RandomForest', 'ExtremelyRandomizedTrees', 'LinearSVC'],
feature_selection_algorithms=['SelectKBest', 'SelectPercentile', 'ParticleSwarmOptimization', 'VarianceThreshold'],
feature_transform_algorithms=['Normalizer', 'StandardScaler']
)
This options entails adding classes like Estimator
or Predictor
and specific child classes like Classifier
and Regressor
.
We would also need new hierarchies for the tasks and metrics.
classifier="SVMRegressor"
pipeline_optimizer = PipelineOptimizer(
data=data_reader,
estimators=['SVMRegressor', 'LinearRegressor'],
task="regression",
feature_selection_algorithms=['SelectKBest', 'SelectPercentile', 'ParticleSwarmOptimization', 'VarianceThreshold'],
feature_transform_algorithms=['Normalizer', 'StandardScaler']
)
This option is way less invasive, as it merely adds components. We do not need to change the API in major ways.
pipeline_optimizer = PipelineOptimizer(
data=data_reader,
classifiers=['SVMRegressor', 'LinearRegressor'],
feature_selection_algorithms=['SelectKBest', 'SelectPercentile', 'ParticleSwarmOptimization', 'VarianceThreshold'],
feature_transform_algorithms=['Normalizer', 'StandardScaler']
)
Given that Regression was not the main task when designing the package, the fact that the semantic inconsistencies when sticking to the classifier wording are not that critical and the fact that the scope of my project is quite limited, I have a slight preference for option 2.
What do you say @firefly-cpp ?
Thanks. I totally support the second option.
The coupling to the classification specifics is higher than I originally thought. I also have to adapt the feature selection and pipeline optimization part, since those currently assume fitness functions to deliver values between 0 and 1. This will take me some time, sry for the slow progress on this ticket.
I need some help in understanding https://github.com/firefly-cpp/NiaAML/blob/master/niaaml/pipeline.py#L468 - https://github.com/firefly-cpp/NiaAML/blob/master/niaaml/pipeline.py#L491
for i in params_all:
args = dict()
for key in i[0]:
if i[0][key] is not None:
if isinstance(i[0][key].value, MinMax):
val = (
solution_vector[solution_index] * i[0][key].value.max
+ i[0][key].value.min
)
if (
i[0][key].param_type is np.intc
or i[0][key].param_type is int
or i[0][key].param_type is np.uintc
or i[0][key].param_type is np.uint
):
val = i[0][key].param_type(np.floor(val))
if val >= i[0][key].value.max:
val = i[0][key].value.max - 1
args[key] = val
else:
args[key] = i[0][key].value[
get_bin_index(
solution_vector[solution_index], len(i[0][key].value)
)
]
solution_index += 1
if i[1] is not None:
i[1].set_parameters(**args)
-> this seems to be some custom (unfortunately undocumented) preprocessing of the parameter configurations.
I do understand the need to call component.set_parameters(**args)
in the framework, but I do not understand:
MinMax
caseget_bin_index
functionCould you help me with some clarifications @firefly-cpp ? π
My first intuition was removing the params preprocessing part, but this is dangerous if I do not understand those parts...
I found the explanation here: https://niaaml.readthedocs.io/en/latest/getting_started.html#optimization-process-and-parameter-tuning
Thanks, @LaurenzBeck, for all the hard work.
As I stated before, documentation is not in the best shape, and thus, it should be modified and updated as soon as possible.
Thank you very much for your hard work on creating a good python package like NiaAML. Could you please consider to support the regression tasks and feature selection with NiaAML?
I use remotely-based data (satellite image) to retrieve the biophysical parameter (blue carbon ;-)) through the machine learning regression and need to select the most contributed features from a suite of input features.
Many thanks, Thang