Time column is normalized frequently leading to a performance issue

antoinecarme commented 4 years ago

When visiting the Time.py code , it seems that the normalization of data column is performed for each horizon.

A run on a long dataset and a long horizon shows that this behavior is very time consuming.

A profiling session is needed.

antoinecarme commented 4 years ago

Using

python3 -m cProfile tests/long_term_forecasts/test_ozone_long_series_hours.py

to perform profiling (no parallelization code activated, one cpu used)

antoinecarme commented 4 years ago

Result without code change (most consuming functions):

   1829/1    0.039    0.000  599.401  599.401 {built-in method builtins.exec}
        1    0.000    0.000  599.401  599.401 test_ozone_long_series_hours.py:1(<module>)
      995    0.037    0.000  420.300    0.422 series.py:3719(apply)
      497    0.107    0.000  419.416    0.844 TimeSeriesModel.py:138(forecastOneStepAhead)
     1635   77.331    0.047  418.176    0.256 {pandas._libs.lib.map_infer}
      501    2.001    0.004  331.140    0.661 Time.py:238(compute_normalize_date_column)
        1    0.003    0.003  329.553  329.553 ForecastEngine.py:22(train)
        1    0.000    0.000  329.550  329.550 SignalDecomposition.py:684(train)
      497    0.022    0.000  315.976    0.636 Time.py:86(transformDataset)
11453/11307    0.153    0.000  300.922    0.027 managers.py:368(apply)
     1092    0.010    0.000  299.609    0.274 generic.py:5563(astype)
     1092    0.003    0.000  299.513    0.274 managers.py:581(astype)
     1092    0.023    0.000  299.028    0.274 blocks.py:554(astype)
      594    0.389    0.001  298.946    0.503 blocks.py:2199(astype)
      613    0.005    0.000  297.808    0.486 datetimelike.py:614(astype)
      613    0.003    0.000  297.797    0.486 datetimelike.py:434(_box_values)
      595    0.008    0.000  297.783    0.500 blocks.py:2141(get_values)
      594    0.004    0.000  297.760    0.501 datetimes.py:579(astype)
 48208454  219.686    0.000  259.841    0.000 datetimes.py:476(<lambda>)

antoinecarme commented 4 years ago

Updated time column and row number clumns only when needed (huge performance improvement)

Result after this change (Total test time changed from 599 to 284) :

   1829/1    0.038    0.000  283.980  283.980 {built-in method builtins.exec}
        1    0.000    0.000  283.980  283.980 test_ozone_long_series_hours.py:1(<module>)
        1    0.002    0.002  205.064  205.064 ForecastEngine.py:22(train)
        1    0.000    0.000  205.062  205.062 SignalDecomposition.py:684(train)
        1    0.000    0.000  185.684  185.684 SignalDecomposition.py:438(train_not_threaded)
        1    0.000    0.000  185.684  185.684 SignalDecomposition.py:336(train)
        4    0.001    0.000  185.675   46.419 SignalDecomposition.py:213(train)
      499    0.014    0.000  106.794    0.214 series.py:3719(apply)
      643   22.915    0.036  105.922    0.165 {pandas._libs.lib.map_infer}
      497    0.101    0.000  104.924    0.211 TimeSeriesModel.py:138(forecastOneStepAhead)
      449    0.092    0.000   62.524    0.139 Scikit_Models.py:108(transformDataset)
        3    0.032    0.011   56.751   18.917 TimeSeriesModel.py:192(forecast)
        4    0.000    0.000   54.537   13.634 SignalDecomposition_Cycle.py:406(estimateAllCycles)
        1    0.000    0.000   54.209   54.209 TimeSeriesModel.py:311(standardPlots)
        1    0.000    0.000   54.209   54.209 SignalDecomposition.py:744(standardPlots)
        1    0.000    0.000   54.209   54.209 ForecastEngine.py:47(standardPlots)
        4    0.015    0.004   54.167   13.542 SignalDecomposition_AR.py:255(estimate)
       48    0.036    0.001   53.813    1.121 SignalDecomposition_AR.py:190(estimate_ar_models_for_cycle)
        4    0.005    0.001   53.132   13.283 SignalDecomposition_Cycle.py:349(estimateCycles)

antoinecarme commented 4 years ago

Next improvement : estimateAllCycles

Cycles computations can be improved by using enums instead of strings for design different date parts. Date part computations at the pandas level are not optimal (many internal type casts).

antoinecarme commented 4 years ago

Result after this change (Total test time going down from 284 to 214) :

   1829/1    0.038    0.000  214.378  214.378 {built-in method builtins.exec}
        1    0.000    0.000  214.378  214.378 test_ozone_long_series_hours.py:1(<module>)
        1    0.011    0.011  134.813  134.813 ForecastEngine.py:22(train)
        1    0.000    0.000  134.802  134.802 SignalDecomposition.py:684(train)
        1    0.000    0.000  115.905  115.905 SignalDecomposition.py:438(train_not_threaded)
        1    0.000    0.000  115.905  115.905 SignalDecomposition.py:336(train)
        4    0.001    0.000  115.895   28.974 SignalDecomposition.py:213(train)
      497    0.084    0.000   83.471    0.168 TimeSeriesModel.py:138(forecastOneStepAhead)
      449    0.095    0.000   62.498    0.139 Scikit_Models.py:108(transformDataset)
        3    0.032    0.011   57.210   19.070 TimeSeriesModel.py:192(forecast)
        4    0.015    0.004   56.378   14.094 SignalDecomposition_AR.py:255(estimate)
       48    0.051    0.001   56.025    1.167 SignalDecomposition_AR.py:190(estimate_ar_models_for_cycle)
        1    0.000    0.000   54.436   54.436 SignalDecomposition.py:744(standardPlots)
        1    0.000    0.000   54.436   54.436 ForecastEngine.py:47(standardPlots)
        1    0.000    0.000   54.435   54.435 TimeSeriesModel.py:311(standardPlots)
       48    0.046    0.001   46.891    0.977 Scikit_Models.py:26(fit)
    47566    0.221    0.000   43.306    0.001 frame.py:2922(__setitem__)
      407    0.015    0.000   42.357    0.104 series.py:3719(apply)
    47566    0.223    0.000   41.865    0.001 frame.py:2988(_set_item)

antoinecarme commented 4 years ago

Travis-ci tests OK :

https://travis-ci.org/github/antoinecarme/pyaf/builds/671923374?utm_medium=notification&utm_source=github_status

antoinecarme / pyaf

Time column is normalized frequently leading to a performance issue #121