blue-yonder / tsfresh

Automatic extraction of relevant features from time series:
http://tsfresh.readthedocs.io
MIT License
8.48k stars 1.22k forks source link

[Feature request] Implement n_jobs=-2 like scikit-learn #817

Open mendel5 opened 3 years ago

mendel5 commented 3 years ago

When working with sklearn (scikit-learn) I am used to setting the parameter n_jobs=-2. As explained at https://scikit-learn.org/stable/glossary.html#term-n_jobs this means:

n_jobs is an integer, specifying the maximum number of concurrently running workers. If 1 is given, no joblib parallelism is used at all, which is useful for debugging. If set to -1, all CPUs are used. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used. For example with n_jobs=-2, all CPUs but one are used.

When I set the parameter n_jobs=-2 in the extract_features() function I get an error: ValueError: Number of processes must be at least 1.

If tsfresh would be able to accept the parameter n_jobs=-2 it would be possible to write code for different kinds of CPUs and tell tsfresh "use all CPU cores except for one core". Therefore the code adapts to the CPU it's running on which might be an older Intel 4-core CPU or a newer Ryzen 8, 12 or 16-core CPU.

nils-braun commented 3 years ago

That is a very good suggestion! Would you like to do a PR? It is mostly used in the MultiprocessingDistributor and in the calculate_relevance_table (and probably a bunch of docstrings).

mendel5 commented 3 years ago

Would you like to do a PR?

I can try it. However it might take some weeks because I'm quite busy right now.

MultiprocessingDistributor

Do you mean this one: https://github.com/blue-yonder/tsfresh/blob/main/tsfresh/utilities/distribution.py#L401?

calculate_relevance_table

Do you mean this one: https://github.com/blue-yonder/tsfresh/blob/main/tsfresh/feature_selection/relevance.py#L31?

and probably a bunch of docstrings

A grep search over the full repo returns this:

$ grep -rni "n_jobs"
docs/text/tsfresh_on_a_cluster.rst:27:`n_jobs`. This field defaults to
docs/text/tsfresh_on_a_cluster.rst:46:`n_jobs` and `chunksize`. Both behave analogue to the parameters
docs/text/tsfresh_on_a_cluster.rst:50:setting the parameter `n_jobs` to 0.
docs/text/tsfresh_on_a_cluster.rst:133:                         n_jobs=4)
tsfresh/feature_selection/relevance.py:37:    n_jobs=defaults.N_PROCESSES,
tsfresh/feature_selection/relevance.py:131:    :param n_jobs: Number of processes to use during the p-value calculation
tsfresh/feature_selection/relevance.py:132:    :type n_jobs: int
tsfresh/feature_selection/relevance.py:195:        if n_jobs == 0:
tsfresh/feature_selection/relevance.py:199:                processes=n_jobs,
tsfresh/feature_selection/relevance.py:230:            if n_jobs != 0:
tsfresh/feature_selection/relevance.py:297:        if n_jobs != 0:
tsfresh/feature_selection/selection.py:25:    n_jobs=defaults.N_PROCESSES,
tsfresh/feature_selection/selection.py:110:    :param n_jobs: Number of processes to use during the p-value calculation
tsfresh/feature_selection/selection.py:111:    :type n_jobs: int
tsfresh/feature_selection/selection.py:170:        n_jobs=n_jobs,
tsfresh/transformers/feature_selector.py:68:        n_jobs=defaults.N_PROCESSES,
tsfresh/transformers/feature_selector.py:101:        :param n_jobs: Number of processes to use during the p-value calculation
tsfresh/transformers/feature_selector.py:102:        :type n_jobs: int
tsfresh/transformers/feature_selector.py:144:        self.n_jobs = n_jobs
tsfresh/transformers/feature_selector.py:180:            n_jobs=self.n_jobs,
tsfresh/transformers/feature_augmenter.py:67:                 n_jobs=tsfresh.defaults.N_PROCESSES, show_warnings=tsfresh.defaults.SHOW_WARNINGS,
tsfresh/transformers/feature_augmenter.py:96:        :param n_jobs: The number of processes to use for parallelization. If zero, no parallelization is used.
tsfresh/transformers/feature_augmenter.py:97:        :type n_jobs: int
tsfresh/transformers/feature_augmenter.py:136:        self.n_jobs = n_jobs
tsfresh/transformers/feature_augmenter.py:205:                                              n_jobs=self.n_jobs, show_warnings=self.show_warnings,
tsfresh/transformers/relevant_feature_augmenter.py:96:        n_jobs=defaults.N_PROCESSES,
tsfresh/transformers/relevant_feature_augmenter.py:150:        :param n_jobs: The number of processes to use for parallelization. If zero, no parallelization is used.
tsfresh/transformers/relevant_feature_augmenter.py:151:        :type n_jobs: int
tsfresh/transformers/relevant_feature_augmenter.py:223:        self.n_jobs = n_jobs
tsfresh/transformers/relevant_feature_augmenter.py:325:                                                      n_jobs=self.feature_extractor.n_jobs,
tsfresh/transformers/relevant_feature_augmenter.py:395:            n_jobs=self.n_jobs,
tsfresh/transformers/relevant_feature_augmenter.py:410:            n_jobs=self.n_jobs,
tsfresh/convenience/relevant_extraction.py:27:                              n_jobs=defaults.N_PROCESSES,
tsfresh/convenience/relevant_extraction.py:89:    :param n_jobs: The number of processes to use for parallelization. If zero, no parallelization is used.
tsfresh/convenience/relevant_extraction.py:90:    :type n_jobs: int
tsfresh/convenience/relevant_extraction.py:168:                             n_jobs=n_jobs,
tsfresh/convenience/relevant_extraction.py:180:                            n_jobs=n_jobs,
tsfresh/scripts/measure_execution_time.py:46:    n_jobs = luigi.IntParameter()
tsfresh/scripts/measure_execution_time.py:59:        extract_features(df, column_id="id", column_sort="time", n_jobs=self.n_jobs,
tsfresh/scripts/measure_execution_time.py:70:            "n_jobs": self.n_jobs,
tsfresh/scripts/measure_execution_time.py:84:    n_jobs = luigi.IntParameter()
tsfresh/scripts/measure_execution_time.py:96:        extract_features(df, column_id="id", column_sort="time", n_jobs=self.n_jobs,
tsfresh/scripts/measure_execution_time.py:103:            "n_jobs": self.n_jobs,
tsfresh/scripts/measure_execution_time.py:121:                                     n_jobs=job,
tsfresh/scripts/measure_execution_time.py:125:                                     n_jobs=job,
tsfresh/scripts/measure_execution_time.py:133:                        n_jobs=job,
tsfresh/scripts/measure_execution_time.py:142:                            n_jobs=job,
tsfresh/feature_extraction/extraction.py:30:                     n_jobs=defaults.N_PROCESSES, show_warnings=defaults.SHOW_WARNINGS,
tsfresh/feature_extraction/extraction.py:91:    :param n_jobs: The number of processes to use for parallelization. If zero, no parallelization is used.
tsfresh/feature_extraction/extraction.py:92:    :type n_jobs: int
tsfresh/feature_extraction/extraction.py:155:                                n_jobs=n_jobs, chunk_size=chunksize,
tsfresh/feature_extraction/extraction.py:177:                   n_jobs, chunk_size, disable_progressbar, show_warnings, distributor,
tsfresh/feature_extraction/extraction.py:214:    :param n_jobs: The number of processes to use for parallelization. If zero, no parallelization is used.
tsfresh/feature_extraction/extraction.py:215:    :type n_jobs: int
tsfresh/feature_extraction/extraction.py:235:            if n_jobs == 0:
tsfresh/feature_extraction/extraction.py:239:                distributor = MultiprocessingDistributor(n_workers=n_jobs,
tsfresh/utilities/dataframe_functions.py:315:                     n_jobs=defaults.N_PROCESSES, show_warnings=defaults.SHOW_WARNINGS,
tsfresh/utilities/dataframe_functions.py:374:    :param n_jobs: The number of processes to use for parallelization. If zero, no parallelization is used.
tsfresh/utilities/dataframe_functions.py:375:    :type n_jobs: int
tsfresh/utilities/dataframe_functions.py:416:                                      n_jobs=n_jobs,
tsfresh/utilities/dataframe_functions.py:478:        if n_jobs == 0:
tsfresh/utilities/dataframe_functions.py:482:            distributor = MultiprocessingDistributor(n_workers=n_jobs,
notebooks/advanced/compare-runtimes-of-feature-calculators.ipynb:173:    "                                                n_jobs=0, \n",
tests/benchmark.py:28:    benchmark(extract_features, df, column_id="id", column_sort="time", n_jobs=0,
tests/benchmark.py:35:    benchmark(extract_features, df, column_id="id", column_sort="time", n_jobs=0,
tests/benchmark.py:43:    benchmark(extract_relevant_features, df, y, column_id="id", column_sort="time", n_jobs=0,
tests/units/feature_selection/test_relevance.py:84:        relevance_table = calculate_relevance_table(X, y_binary, n_jobs=0)
tests/units/feature_selection/test_relevance.py:103:        relevance_table = calculate_relevance_table(X, y_real, n_jobs=0)
tests/units/feature_selection/test_relevance.py:138:                X, y_real, n_jobs=0, ml_task="regression", show_warnings=True
tests/units/transformers/test_feature_augmenter.py:24:                                     n_jobs=0,
tests/units/transformers/test_feature_augmenter.py:60:                                     n_jobs=0,
tests/units/transformers/test_feature_augmenter.py:87:                                     n_jobs=0,
tests/units/feature_extraction/test_extraction.py:22:        self.n_jobs = 1
tests/units/feature_extraction/test_extraction.py:30:                                              n_jobs=self.n_jobs)
tests/units/feature_extraction/test_extraction.py:44:                                                  n_jobs=self.n_jobs)
tests/units/feature_extraction/test_extraction.py:54:                                              column_value="val", n_jobs=self.n_jobs,
tests/units/feature_extraction/test_extraction.py:121:                                              n_jobs=self.n_jobs).sort_index()
tests/units/feature_extraction/test_extraction.py:126:                                                          n_jobs=self.n_jobs).sort_index()
tests/units/feature_extraction/test_extraction.py:140:        X = extract_features(df, column_id="id", column_value="val", n_jobs=self.n_jobs,
tests/units/feature_extraction/test_extraction.py:152:        extract_features(df, column_id="id", column_value="val", n_jobs=self.n_jobs,
tests/units/feature_extraction/test_extraction.py:164:                             n_jobs=self.n_jobs)
tests/units/feature_extraction/test_extraction.py:173:                                             n_jobs=self.n_jobs)
tests/units/feature_extraction/test_extraction.py:177:                                           n_jobs=0)
tests/units/feature_extraction/test_extraction.py:188:                                              n_jobs=self.n_jobs)
tests/units/feature_extraction/test_extraction.py:210:        self.n_jobs = 2
tests/units/feature_extraction/test_extraction.py:226:                                              n_jobs=self.n_jobs)
tests/units/feature_extraction/test_settings.py:59:                                 n_jobs=0)
tests/units/feature_extraction/test_settings.py:64:                                 n_jobs=0)
tests/units/utilities/test_dataframe_functions.py:23:                          rolling_direction=1, n_jobs=0)
tests/units/utilities/test_dataframe_functions.py:29:                          rolling_direction=1, n_jobs=0)
tests/units/utilities/test_dataframe_functions.py:34:                          rolling_direction=1, n_jobs=0)
tests/units/utilities/test_dataframe_functions.py:40:                          rolling_direction=1, n_jobs=0)
tests/units/utilities/test_dataframe_functions.py:45:                          rolling_direction=1, n_jobs=0)
tests/units/utilities/test_dataframe_functions.py:50:                          rolling_direction=0, n_jobs=0)
tests/units/utilities/test_dataframe_functions.py:55:                          rolling_direction=0, n_jobs=0)
tests/units/utilities/test_dataframe_functions.py:62:                          rolling_direction=1, n_jobs=0)
tests/units/utilities/test_dataframe_functions.py:68:                          rolling_direction=1, n_jobs=0)
tests/units/utilities/test_dataframe_functions.py:75:                          rolling_direction=1, n_jobs=0)
tests/units/utilities/test_dataframe_functions.py:114:                                                  column_kind=None, rolling_direction=1, n_jobs=0)
tests/units/utilities/test_dataframe_functions.py:122:                                                  max_timeshift=4, n_jobs=0)
tests/units/utilities/test_dataframe_functions.py:130:                                                  max_timeshift=2, n_jobs=0)
tests/units/utilities/test_dataframe_functions.py:154:                                                  max_timeshift=2, min_timeshift=2, n_jobs=0)
tests/units/utilities/test_dataframe_functions.py:208:                                                  column_kind=None, rolling_direction=-1, n_jobs=0)
tests/units/utilities/test_dataframe_functions.py:216:                                                  max_timeshift=None, n_jobs=0)
tests/units/utilities/test_dataframe_functions.py:224:                                                  max_timeshift=1, n_jobs=0)
tests/units/utilities/test_dataframe_functions.py:247:                                                  max_timeshift=2, n_jobs=0)
tests/units/utilities/test_dataframe_functions.py:272:                                                  max_timeshift=4, n_jobs=0)
tests/units/utilities/test_dataframe_functions.py:298:                                                  min_timeshift=2, max_timeshift=3, n_jobs=0)
tests/units/utilities/test_dataframe_functions.py:348:                                                  column_kind=None, rolling_direction=2, n_jobs=0)
tests/units/utilities/test_dataframe_functions.py:368:                                                  column_kind=None, rolling_direction=-2, n_jobs=0)
tests/units/utilities/test_dataframe_functions.py:404:                                                  column_kind="kind", rolling_direction=-1, n_jobs=0)
tests/units/utilities/test_dataframe_functions.py:427:                                                  rolling_direction=-1, n_jobs=0)
tests/units/utilities/test_dataframe_functions.py:477:                                                  rolling_direction=-1, max_timeshift=1, n_jobs=0)
tests/units/utilities/test_dataframe_functions.py:571:                                                 column_kind=None, rolling_direction=1, n_jobs=0)
tests/units/utilities/test_dataframe_functions.py:623:                                                  column_kind=None, rolling_direction=1, n_jobs=0)
nils-braun commented 3 years ago

I can try it. However it might take some weeks because I'm quite busy right now.

That would be awesome! If this is not fast enough for you, I can also try to have a look - but more contributors is always better :-)

Do you mean this one:

Yes and yes. Sorry, I was on the smartphone - thanks for providing the links. These two code parts are basically the only two where the n_jobs is actually used (the rest just passes it).

A grep search over the full repo returns this:

Here are the docstrings that one would need to fix (the rest is not relevant, as only variables are passed).

tsfresh/convenience/relevant_extraction.py:    :param n_jobs: The number of processes to use for parallelization. If zero, no parallelization is used.
tsfresh/feature_extraction/extraction.py:    :param n_jobs: The number of processes to use for parallelization. If zero, no parallelization is used.
tsfresh/feature_extraction/extraction.py:    :param n_jobs: The number of processes to use for parallelization. If zero, no parallelization is used.
tsfresh/feature_selection/relevance.py:    :param n_jobs: Number of processes to use during the p-value calculation
tsfresh/feature_selection/selection.py:    :param n_jobs: Number of processes to use during the p-value calculation
tsfresh/transformers/feature_augmenter.py:        :param n_jobs: The number of processes to use for parallelization. If zero, no parallelization is used.
tsfresh/transformers/feature_selector.py:        :param n_jobs: Number of processes to use during the p-value calculation
tsfresh/transformers/relevant_feature_augmenter.py:        :param n_jobs: The number of processes to use for parallelization. If zero, no parallelization is used.
tsfresh/utilities/dataframe_functions.py:    :param n_jobs: The number of processes to use for parallelization. If zero, no parallelization is used.
nils-braun commented 3 years ago

You could have a look into https://github.com/blue-yonder/tsfresh/pull/852/files to get some starter :-)