aimclub / FEDOT

Automated modeling and machine learning framework FEDOT
https://fedot.readthedocs.io
BSD 3-Clause "New" or "Revised" License
619 stars 84 forks source link

Atomized model operation #1227

Closed kasyanovse closed 4 months ago

kasyanovse commented 6 months ago

Закрыт в связи с неактуальностью.

Linked:

План:

Разбираемся почему плохо

Мутации

H0: Нет разницы в положении внутри поколения между пайплайнами с определенными мутациями H1: ~H0

Выводы и результаты:

  1. Изменение окна связано с более высоким фитнесом внутри поколения.
  2. Мутации параметров, а особенно окна, вредят фитнесу если сделать автоматический подбор окна с помощью WindowSizeSelector из https://github.com/aimclub/FEDOT/pull/1186.
  3. Для добавления в мастер федота автоподбора окна создан PR https://github.com/aimclub/FEDOT/pull/1237
  4. Вроде бы новые операции делают пайплайны чуть лучше. Эффект небольшой, но статистически значимый (если стат тесты проведены корректно).
Код ```py # fedots is list with fitted Fedot objects inds = [gen for gen in chain(*[fedot.history.generations for fedot in fedots]) if gen.generation_num != 0] inds = [[x for x in gen if x.fitness.values[0] < 10 and x.parent_operator is not None] for gen in inds] inds = [sorted(gen, key=lambda x: x.fitness.values[0]) for gen in inds] t1 = dict() ww, wow = list(), list() strs = ('atomized', 'atomized_ts_differ')#, 'atomized_ts_to_time') wstr, wostr = {k: list() for k in strs}, {k: list() for k in strs} windows = dict() print('there are', len(list(chain(*inds))), 'individuals') for gen in inds: for _i, ind in enumerate(gen): mutation_name = ind.parent_operator.operators[0] individual_value = _i / len(gen) if 'parameter' in mutation_name: old_lagged = tuple([node.parameters['window_size'] for node in ind.parent_operator.parent_individuals[0].graph.nodes if node.name == 'lagged' and node.parameters and 'window_size' in node.parameters]) new_lagged = tuple([node.parameters['window_size'] for node in ind.graph.nodes if node.name == 'lagged' and node.parameters and 'window_size' in node.parameters]) if ind.native_generation not in windows: windows[ind.native_generation] = list() if new_lagged: windows[ind.native_generation].append(np.mean(new_lagged)) (wow if new_lagged == old_lagged else ww).append(individual_value) for key in wstr: (wstr if key in str(ind.graph) else wostr)[key].append(individual_value) if mutation_name not in t1: t1[mutation_name] = list() t1[mutation_name].append(individual_value) stat = lambda x, y: np.mean(x) - np.mean(y) def test(name, *args): pvalues = [scipy.stats.permutation_test(args, stat, n_resamples=1000).pvalue for _ in range(10)] return f"{name} | {stat(*args):.2f} | {np.mean(pvalues):.1%} [{np.min(pvalues):.1%}-{np.max(pvalues):.1%}]" for key in wstr: print(test(key.upper(), wstr[key], wostr[key])) if ww and wow: print(test('WINDOW MUTATION', ww, wow)) for mutation in t1: s1, s2 = t1[mutation], list(chain(*[t1[x] for x in t1 if x != mutation])) if len(s1) > 5 and len(s2) > 5: print(test(mutation, s1, s2)) ```
Вывод НАЧАЛЬНЫЙ ВЫВОД БЕЗ АВТОПОДБОРА ОКНА Мутация | Изменение квантили по метрике в поколении | pvalue | Значимая разница ---| --- | --- | --- WINDOW MUTATION | -0.2 | 0.0% | !!!!!! insert_atomized_operation | -0.2 | 0.0% | !!!!!! single_edge_mutation | 0.2 | 0.0% | !!!!!! single_change_mutation | -0.0 | 30.6% | single_add_mutation | 0.0 | 65.1% | parameter_change_mutation | 0.1 | 0.0% | !!!!!! single_drop_mutation | -0.0 | 77.0% | С АВТОПОДБОРОМ ОКНА Название мутации или типа модели | Изменение квантили по метрике в поколении | pvalue ---| --- | --- ATOMIZED | -0.01 | 44.6% [41.0%-48.2%] WINDOW MUTATION | 0.12 | 2.6% [1.6%-4.0%] insert_atomized_operation | 0.01 | 59.0% [51.3%-65.1%] parameter_change_mutation | 0.03 | 11.2% [9.6%-12.8%] single_add_mutation | -0.05 | 13.2% [11.0%-16.0%] single_change_mutation | -0.10 | 0.2% [0.2%-0.2%] single_drop_mutation | 0.05 | 9.7% [7.6%-11.4%] single_edge_mutation | 0.14 | 0.2% [0.2%-0.2%] С АВТОПОДБОРОМ ОКНА И БЕЗ НЕПОЛЕЗНЫХ МУТАЦИЙ Название мутации или типа модели | Изменение квантили по метрике в поколении | pvalue ---| --- | --- ATOMIZED | -0.04 | 0.2% [0.2%-0.2%] insert_atomized_operation | -0.02 | 4.0% [3.4%-4.4%] single_drop_mutation | 0.05 | 0.3% [0.2%-0.4%] single_add_mutation | -0.03 | 1.6% [1.2%-2.2%] single_change_mutation | 0.01 | 22.0% [18.6%-26.2%]

Сравнение метрик пайплайнов с atomized и nonatomized

Производится единственный запуск Fedot без фиксации seed на заданных данных. Считаются метрики всех моделей в процессе оптимизации в контексте одного поколения, значения метрик разделяются на группу метрик от atomized и группу метрик остальных моделей.

H0: распределения atomized и non-atomized метрик в одном поколении имеют одинаковые матожидания H1: ~H0

Вывод: H0 отвергается, на одном поколении большой мощности матожидания метрик не совпадают для atomized и non-atomized.

Листинг кода стат тестов ```py import logging from random import random from itertools import chain import numpy as np from matplotlib import pyplot as plt from scipy.stats import f_oneway, ttest_ind, alexandergovern from statistics import mean from fedot.api.main import Fedot from fedot.core.pipelines.pipeline import Pipeline from fedot.core.repository.tasks import Task, TaskTypesEnum, TsForecastingParams from fedot.core.data.data import InputData from fedot.core.repository.dataset_types import DataTypesEnum from fedot.core.data.data_split import train_test_data_setup RANDOM_SEED = 100 NUM_EXPERIMENTS = 10 ALPHA = 0.05 def get_data(data_length: int = 500, test_length: int = 100) -> InputData: garmonics = [(0.1, 0.9), (0.1, 1), (0.1, 1.1), (0.05, 2), (0.05, 5), (1, 0.02)] for _ in range(5): garmonics += [(random() * 0.1 + 0.1, random() * 2)] time = np.linspace(0, 100, data_length) data = time * 0 for g in garmonics: data += g[0] * np.sin(g[1] * 2 * np.pi / time[-1] * 25 * time) data += time * 0.1 data = InputData(idx=np.arange(0, data.shape[0]), features=data, target=data, task=Task(TaskTypesEnum.ts_forecasting, TsForecastingParams(forecast_length=test_length)), data_type=DataTypesEnum.ts) return train_test_data_setup(data, validation_blocks=1, split_ratio=(data_length - test_length) / ((data_length - test_length) + test_length)) def get_fitted_fedot(train: InputData, test: InputData, random_seed: int = RANDOM_SEED) -> Fedot: initial_assumption = None fedot = Fedot(problem='ts_forecasting', task_params=TsForecastingParams(forecast_length=test.idx.shape[0]), logging_level=logging.WARNING, timeout=5, pop_size=20, num_of_generations=3, n_jobs=10, with_tuning=False, initial_assumption=initial_assumption, ) fedot.fit(train) return fedot if __name__ == '__main__': train, test = get_data() fedot = get_fitted_fedot(train, test) for population in fedot.history.generations: atomized_metrics, nonatomized_metrics = [], [] for individual in population: if individual.fitness.value < 1: if 'atomized' in individual.graph.descriptive_id: atomized_metrics.append(individual.fitness.value) else: nonatomized_metrics.append(individual.fitness.value) if len(atomized_metrics) and len(nonatomized_metrics): _, p_anova = f_oneway(atomized_metrics, nonatomized_metrics) _, p_ttest = ttest_ind(atomized_metrics, nonatomized_metrics) p_agovern = alexandergovern(atomized_metrics, nonatomized_metrics).pvalue print(f'\nAtomized metrics length: {len(atomized_metrics)}') print(f'Atomized metrics mean: {mean(atomized_metrics)}\n') print(f'Non-Atomized metrics length: {len(nonatomized_metrics)}') print(f'Non-Atomized metrics mean: {mean(nonatomized_metrics)}\n') print(f'ALEXANDERGOVERN: H0 {p_agovern > ALPHA} (p-value: {p_agovern})') print(f'ANOVA: H0 {p_anova > ALPHA} (p-value: {p_anova})') print(f'TTEST: H0 {p_ttest > ALPHA} (p-value: {p_ttest})\n') ```
Результаты на одном запуске fedot в одном поколении ``` Atomized metrics length: 23 Atomized metrics mean: 0.3193077976971991 Non-Atomized metrics length: 40 Non-Atomized metrics mean: 0.42342097047587063 ALEXANDERGOVERN: H0 False (p-value: 0.0009336975906277874) ANOVA: H0 False (p-value: 0.0035480628380782806) TTEST: H0 False (p-value: 0.0035480628380782525) ```
Результаты на нескольких запусках fedot при фиксированных данных diff - абсолютная разница между nonatomized mean и atomized mean `atomized mean metric`|`nonatomized mean metric`|`diff` ------------------------|---------------------------|----------------------- 0.40310965163897533|0.5258254039784768|0.12271575233950144 0.38941925878670225|0.5029692174546525|0.11354995866795026 0.39899222822731417|0.4946715308181331|0.09567930259081892 0.3828210686266902|0.3825838994677389|-0.0002371691589512781 0.40596368379091746|0.4952406389814257|0.08927695519050821 0.3794686018324255|0.3911405440142768|0.011671942181851303 0.38953820082509855|0.49915707765067835|0.1096188768255798 0.38192409680048195|0.4793310851871796|0.09740698838669765 0.3824244353971997|0.35815678898201725|-0.024267646415182476 0.40142854838786285|0.5070242861220102|0.10559573773414738 0.3655888254449583|0.4032475104193803|0.037658684974421985 0.39817458028926306|0.5014504247826739|0.10327584449341082 0.3848995605685277|0.44268693656915314|0.057787376000625446 0.4030127105036263|0.5233831563043596|0.1203704458007333 0.3926023945342515|0.4266703314578744|0.03406793692362292 0.38813247164237114|0.49407357595967194|0.1059411043173008 0.3702466107812836|0.3582309620735185|-0.012015648707765114
pep8speaks commented 6 months ago

Hello @kasyanovse! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 15:34: F821 undefined name 'Pipeline' Line 27:92: F821 undefined name 'Pipeline' Line 33:35: F821 undefined name 'Pipeline' Line 44:43: F821 undefined name 'Pipeline' Line 51:36: F821 undefined name 'MetricCallable' Line 75:76: F821 undefined name 'PipelineNode'

Line 5:1: F401 'fedot.core.operations.evaluation.evaluation_interfaces.SkLearnEvaluationStrategy' imported but unused Line 6:1: F401 'fedot.core.operations.evaluation.operation_implementations.data_operations.decompose.DecomposerRegImplementation' imported but unused Line 8:1: F401 'fedot.core.operations.evaluation.operation_implementations.data_operations.sklearn_filters.IsolationForestRegImplementation' imported but unused Line 10:1: F401 'fedot.core.operations.evaluation.operation_implementations.data_operations.sklearn_filters.LinearRegRANSACImplementation' imported but unused Line 10:1: F401 'fedot.core.operations.evaluation.operation_implementations.data_operations.sklearn_filters.NonLinearRegRANSACImplementation' imported but unused Line 12:1: F401 'fedot.core.operations.evaluation.operation_implementations.data_operations.sklearn_selectors.LinearRegFSImplementation' imported but unused Line 12:1: F401 'fedot.core.operations.evaluation.operation_implementations.data_operations.sklearn_selectors.NonLinearRegFSImplementation' imported but unused Line 22:1: F401 'fedot.core.operations.evaluation.operation_implementations.models.knn.FedotKnnRegImplementation' imported but unused Line 24:1: F401 'fedot.utilities.random.ImplementationRandomStateHandler' imported but unused

Line 1:1: F401 'typing.Union' imported but unused Line 1:1: F401 'typing.Any' imported but unused Line 1:1: F401 'typing.Dict' imported but unused Line 6:1: F401 'fedot.core.operations.atomized_model.atomized_model.AtomizedModel' imported but unused Line 9:1: F401 'fedot.core.operations.operation_parameters.OperationParameters' imported but unused Line 12:1: F401 'fedot.core.pipelines.pipeline_node_factory.PipelineOptNodeFactory' imported but unused Line 13:1: F401 'fedot.core.pipelines.random_pipeline_factory.RandomPipelineFactory' imported but unused Line 14:1: F401 'fedot.core.repository.pipeline_operation_repository.PipelineOperationRepository' imported but unused Line 15:1: F401 'fedot.core.repository.tasks.TsForecastingParams' imported but unused Line 15:1: F401 'fedot.core.repository.tasks.Task' imported but unused

Line 14:53: W292 no newline at end of file

Line 10:1: F401 'fedot.core.repository.tasks.TsForecastingParams' imported but unused Line 10:1: F401 'fedot.core.repository.tasks.Task' imported but unused Line 37:21: F841 local variable 'target' is assigned to but never used

Line 3:1: F401 'numpy as np' imported but unused Line 10:1: F401 'fedot.core.repository.tasks.TsForecastingParams' imported but unused Line 10:1: F401 'fedot.core.repository.tasks.Task' imported but unused Line 56:39: E127 continuation line over-indented for visual indent Line 57:39: E127 continuation line over-indented for visual indent Line 58:39: E127 continuation line over-indented for visual indent

Line 1:1: F401 'typing.Union' imported but unused Line 1:1: F401 'typing.Any' imported but unused Line 1:1: F401 'typing.Dict' imported but unused Line 6:1: F401 'fedot.core.operations.atomized_model.atomized_model.AtomizedModel' imported but unused Line 9:1: F401 'fedot.core.operations.operation_parameters.OperationParameters' imported but unused Line 12:1: F401 'fedot.core.pipelines.pipeline_node_factory.PipelineOptNodeFactory' imported but unused Line 13:1: F401 'fedot.core.pipelines.random_pipeline_factory.RandomPipelineFactory' imported but unused Line 15:1: F401 'fedot.core.repository.pipeline_operation_repository.PipelineOperationRepository' imported but unused Line 16:1: F401 'fedot.core.repository.tasks.TsForecastingParams' imported but unused Line 32:78: E231 missing whitespace after ','

Line 6:1: F401 'fedot.core.repository.operation_types_repository.get_operation_type_from_id' imported but unused

Line 36:30: F541 f-string is missing placeholders

Line 5:1: F401 'typing.Callable' imported but unused Line 9:1: F401 'fedot.core.pipelines.node.PipelineNode' imported but unused Line 10:1: F401 'fedot.core.pipelines.pipeline.Pipeline' imported but unused Line 11:1: F401 'fedot.core.repository.operation_types_repository.OperationTypesRepository' imported but unused Line 13:1: F401 'golem.core.optimisers.genetic.operators.base_mutations.single_edge_mutation' imported but unused Line 13:1: F401 'golem.core.optimisers.genetic.operators.base_mutations.single_add_mutation' imported but unused Line 13:1: F401 'golem.core.optimisers.genetic.operators.base_mutations.single_change_mutation' imported but unused Line 13:1: F401 'golem.core.optimisers.genetic.operators.base_mutations.single_drop_mutation' imported but unused Line 18:1: F401 'golem.core.optimisers.optimization_parameters.GraphRequirements' imported but unused Line 19:1: F401 'golem.core.optimisers.optimizer.GraphGenerationParams' imported but unused Line 20:1: F401 'golem.core.optimisers.genetic.gp_params.GPAlgorithmParameters' imported but unused Line 33:121: E501 line too long (125 > 120 characters)

Line 5:1: F401 'typing.Union' imported but unused Line 55:1: E303 too many blank lines (3) Line 83:5: E129 visually indented line with same indent as next logical line

Line 319:13: F841 local variable 'ex' is assigned to but never used

Line 4:1: F401 'itertools.chain' imported but unused

Line 9:1: F401 'fedot.core.composer.metrics.RMSE' imported but unused

Line 9:121: E501 line too long (130 > 120 characters) Line 10:121: E501 line too long (132 > 120 characters) Line 11:121: E501 line too long (130 > 120 characters)

Comment last updated at 2023-12-22 08:59:54 UTC
codecov[bot] commented 6 months ago

Codecov Report

Attention: 138 lines in your changes are missing coverage. Please review.

Comparison is base (299ffba) 79.47% compared to head (7c2e51a) 78.45%. Report is 1 commits behind head on master.

:exclamation: Current head 7c2e51a differs from pull request most recent head 713b1d5. Consider uploading reports for the commit 713b1d5 to get more accurate results

Files Patch % Lines
...e/operations/atomized_model/atomized_ts_sampler.py 0.00% 51 Missing :warning:
...re/operations/atomized_model/atomized_ts_differ.py 0.00% 42 Missing :warning:
...re/operations/atomized_model/atomized_ts_scaler.py 0.00% 39 Missing :warning:
...edot/core/optimisers/genetic_operators/mutation.py 89.74% 4 Missing :warning:
...t/core/operations/atomized_model/atomized_model.py 0.00% 1 Missing :warning:
...ore/operations/atomized_model/atomized_template.py 0.00% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #1227 +/- ## ========================================== - Coverage 79.47% 78.45% -1.02% ========================================== Files 145 149 +4 Lines 9928 10109 +181 ========================================== + Hits 7890 7931 +41 - Misses 2038 2178 +140 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.