Projections Wrongfully Linear

bjakobson commented 1 year ago

I am training a basic model that is comparing weight lifted vs. time.

As you will notice, the timeline is pretty limited, but this will likely be the case in most of my uses. The visual (shown below) is linear, which is obviously incorrect.

I am not too advanced in Python or forecasting, but visually, something looks wrong. Here is my full code, which includes data:

import pandas as pd
import ssl
import datetime
import matplotlib.pyplot as plt
import pyaf.ForecastEngine as autof
import numpy

temp_data = [

    {
        "weight" : 185.0,
        "date" : "2021-11-19"
    },
    {
        "weight" : 165.0,
        "date" : "2021-11-22"
    },
    {
        "weight" : 145.0,
        "date" : "2021-11-28"
    },
    {
        "weight" : 175.0,
        "date" : "2021-12-01"
    },

    {
        "weight" : 145.0,
        "date" : "2021-12-08"
    },
    {
        "weight" : 150.0,
        "date" : "2021-12-12"
    },
    {
        "weight" : 190.0,
        "date" : "2021-12-18"
    },
    {
        "weight" : 200.0,
        "date" : "2021-12-24"
    },
    {
        "weight" : 180.0,
        "date" : "2021-12-27"
    },
    {
        "weight" : 175.0,
        "date" : "2022-01-01"
    },
    {
        "weight" : 160.0,
        "date" : "2022-01-05"
    },
]

#data = numpy.toarray(temp_data)

if __name__ == '__main__':
    weight_dataframe = pd.DataFrame(temp_data)
    print(weight_dataframe)
    weight_dataframe['date'] = weight_dataframe['date'].apply(lambda x : datetime.datetime.strptime(x, "%Y-%m-%d"))
    weight_dataframe.head()

    lEngine = autof.cForecastEngine();
    lEngine.train(weight_dataframe , 'date' , 'weight', 50);
    weight_forecast_dataframe = lEngine.forecast(weight_dataframe, 50);
    lEngine.getModelInfo() # => relative error 7% (MAPE)

    #print(weight_forecast_dataframe)
    weight_forecast_dataframe.plot.line('date', ['weight', 'weight_Forecast_Upper_Bound', 'weight_Forecast_Quantile_50', 'weight_Forecast_Lower_Bound'], grid = True, figsize=(12, 8), marker = 'o', color = ['#A1A5FF', 'green', 'blue', 'red'], title = 'Bench Press Projections');
    plt.legend(['Previous Weight', 'Max Projected Weight', 'Median Projected Weight', 'Min Projected Weight'])
    plt.ylabel('Weight')
    plt.xlabel('Date')
    plt.show()

Here is a visual output:

Here is my system info as requested:

/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/_distutils_hack/init.py:33: UserWarning: Setuptools is replacing distutils. warnings.warn("Setuptools is replacing distutils.") PYAF_SYSTEM_DEPENDENT_VERSION_INFO ('Cython_version', 'NOT_INSTALLED') PYAF_SYSTEM_DEPENDENT_VERSION_INFO ('dill_version', '0.3.6') PYAF_SYSTEM_DEPENDENT_VERSION_INFO ('keras_version', 'NOT_INSTALLED') PYAF_SYSTEM_DEPENDENT_VERSION_INFO ('lightgbm_version', 'NOT_INSTALLED') PYAF_SYSTEM_DEPENDENT_VERSION_INFO ('matplotlib_version', '3.6.2') PYAF_SYSTEM_DEPENDENT_VERSION_INFO ('numpy_version', '1.23.5') PYAF_SYSTEM_DEPENDENT_VERSION_INFO ('pandas_version', '1.5.2') PYAF_SYSTEM_DEPENDENT_VERSION_INFO ('pathos_version', 'NOT_INSTALLED') PYAF_SYSTEM_DEPENDENT_VERSION_INFO ('pip_version', '22.3') PYAF_SYSTEM_DEPENDENT_VERSION_INFO ('pyaf_version', '4.0') PYAF_SYSTEM_DEPENDENT_VERSION_INFO ('pydot_version', '1.4.2') PYAF_SYSTEM_DEPENDENT_VERSION_INFO ('python_implementation', 'CPython') PYAF_SYSTEM_DEPENDENT_VERSION_INFO ('python_version', '3.11.0') PYAF_SYSTEM_DEPENDENT_VERSION_INFO ('scipy_version', '1.9.3') PYAF_SYSTEM_DEPENDENT_VERSION_INFO ('setuptools_version', '65.5.0') PYAF_SYSTEM_DEPENDENT_VERSION_INFO ('sklearn_version', '1.1.3') PYAF_SYSTEM_DEPENDENT_VERSION_INFO ('skorch_version', 'NOT_INSTALLED') PYAF_SYSTEM_DEPENDENT_VERSION_INFO ('sqlalchemy_version', '1.4.44') PYAF_SYSTEM_DEPENDENT_VERSION_INFO ('system_platform', 'macOS-12.5-arm64-arm-64bit') PYAF_SYSTEM_DEPENDENT_VERSION_INFO ('system_processor', 'arm') PYAF_SYSTEM_DEPENDENT_VERSION_INFO ('system_uname', uname_result(system='Darwin', node='MacBook-Pro.local', release='21.6.0', version='Darwin Kernel Version 21.6.0: Sat Jun 18 17:07:22 PDT 2022; root:xnu-8020.140.41~1/RELEASE_ARM64_T6000', machine='arm64')) PYAF_SYSTEM_DEPENDENT_VERSION_INFO ('torch_version', 'NOT_INSTALLED') PYAF_SYSTEM_DEPENDENT_VERSION_INFO ('xgboost_version', '1.7.1') PYAF_SYSTEM_DEPENDENT_ENVIRONMENT_VARIABLE ('COLORTERM', 'truecolor') PYAF_SYSTEM_DEPENDENT_ENVIRONMENT_VARIABLE ('COMMAND_MODE', 'unix2003') PYAF_SYSTEM_DEPENDENT_ENVIRONMENT_VARIABLE ('GIT_ASKPASS', '/private/var/folders/_v/tdvwxstj3ljd7x9hdh16s8kc0000gn/T/AppTranslocation/98905D2F-13A3-4069-B8FB-27DEDF170F99/d/Visual Studio Code.app/Contents/Resources/app/extensions/git/dist/askpass.sh') PYAF_SYSTEM_DEPENDENT_ENVIRONMENT_VARIABLE ('HOME', '/Users/brandonjakobson') PYAF_SYSTEM_DEPENDENT_ENVIRONMENT_VARIABLE ('KMP_DUPLICATE_LIB_OK', 'True') PYAF_SYSTEM_DEPENDENT_ENVIRONMENT_VARIABLE ('KMP_INIT_AT_FORK', 'FALSE') PYAF_SYSTEM_DEPENDENT_ENVIRONMENT_VARIABLE ('LANG', 'en_US.UTF-8') PYAF_SYSTEM_DEPENDENT_ENVIRONMENT_VARIABLE ('LOGNAME', 'brandonjakobson') PYAF_SYSTEM_DEPENDENT_ENVIRONMENT_VARIABLE ('MallocNanoZone', '0') PYAF_SYSTEM_DEPENDENT_ENVIRONMENT_VARIABLE ('OLDPWD', '/Users/brandonjakobson/Downloads/WorkoutProjections') PYAF_SYSTEM_DEPENDENT_ENVIRONMENT_VARIABLE ('ORIGINAL_XDG_CURRENT_DESKTOP', 'undefined') PYAF_SYSTEM_DEPENDENT_ENVIRONMENT_VARIABLE ('PATH', '/opt/homebrew/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/Applications/VMware') PYAF_SYSTEM_DEPENDENT_ENVIRONMENT_VARIABLE ('PWD', '/Users/brandonjakobson/Downloads/WorkoutProjections') PYAF_SYSTEM_DEPENDENT_ENVIRONMENT_VARIABLE ('SHELL', '/bin/zsh') PYAF_SYSTEM_DEPENDENT_ENVIRONMENT_VARIABLE ('SHLVL', '1') PYAF_SYSTEM_DEPENDENT_ENVIRONMENT_VARIABLE ('SSH_AUTH_SOCK', '/private/tmp/com.apple.launchd.vZZcYkY6Qx/Listeners') PYAF_SYSTEM_DEPENDENT_ENVIRONMENT_VARIABLE ('TERM', 'xterm-256color') PYAF_SYSTEM_DEPENDENT_ENVIRONMENT_VARIABLE ('TERM_PROGRAM', 'vscode') PYAF_SYSTEM_DEPENDENT_ENVIRONMENT_VARIABLE ('TERM_PROGRAM_VERSION', '1.73.0') PYAF_SYSTEM_DEPENDENT_ENVIRONMENT_VARIABLE ('TMPDIR', '/var/folders/_v/tdvwxstj3ljd7x9hdh16s8kc0000gn/T/') PYAF_SYSTEM_DEPENDENT_ENVIRONMENT_VARIABLE ('USER', 'brandonjakobson') PYAF_SYSTEM_DEPENDENT_ENVIRONMENT_VARIABLE ('USER_ZDOTDIR', '/Users/brandonjakobson') PYAF_SYSTEM_DEPENDENT_ENVIRONMENT_VARIABLE ('VSCODE_GIT_ASKPASS_EXTRA_ARGS', '--ms-enable-electron-run-as-node') PYAF_SYSTEM_DEPENDENT_ENVIRONMENT_VARIABLE ('VSCODE_GIT_ASKPASS_MAIN', '/private/var/folders/_v/tdvwxstj3ljd7x9hdh16s8kc0000gn/T/AppTranslocation/98905D2F-13A3-4069-B8FB-27DEDF170F99/d/Visual Studio Code.app/Contents/Resources/app/extensions/git/dist/askpass-main.js') PYAF_SYSTEM_DEPENDENT_ENVIRONMENT_VARIABLE ('VSCODE_GIT_ASKPASS_NODE', '/private/var/folders/_v/tdvwxstj3ljd7x9hdh16s8kc0000gn/T/AppTranslocation/98905D2F-13A3-4069-B8FB-27DEDF170F99/d/Visual Studio Code.app/Contents/Frameworks/Code Helper.app/Contents/MacOS/Code Helper') PYAF_SYSTEM_DEPENDENT_ENVIRONMENT_VARIABLE ('VSCODE_GIT_IPC_HANDLE', '/var/folders/_v/tdvwxstj3ljd7x9hdh16s8kc0000gn/T/vscode-git-810feb144a.sock') PYAF_SYSTEM_DEPENDENT_ENVIRONMENT_VARIABLE ('VSCODE_INJECTION', '1') PYAF_SYSTEM_DEPENDENT_ENVIRONMENT_VARIABLE ('XPC_FLAGS', '0x0') PYAF_SYSTEM_DEPENDENT_ENVIRONMENT_VARIABLE ('XPC_SERVICE_NAME', '0') PYAF_SYSTEM_DEPENDENT_ENVIRONMENT_VARIABLE ('ZDOTDIR', '/Users/brandonjakobson') PYAF_SYSTEM_DEPENDENT_ENVIRONMENTVARIABLE ('', '/usr/local/bin/python3') PYAF_SYSTEM_DEPENDENT_ENVIRONMENT_VARIABLE ('CFBundleIdentifier', 'com.microsoft.VSCode') PYAF_SYSTEM_DEPENDENT_ENVIRONMENT_VARIABLE ('CF_USER_TEXT_ENCODING', '0x1F5:0x0:0x0')

antoinecarme commented 1 year ago

Hi @bjakobson

First. Thanks a lot for using PyAF. the script and the environment variables are simply perfect and help reproducing the problem.

I will answer your questions in two different comments. The first for functional aspects, the second for some technical points.

antoinecarme commented 1 year ago

Functional aspects.

Here, you are trying to predict weights, given the 10 weekly values of Nov and Dec in 2021. The horizon is 50, which means that you want to predict the values for 50 next weeks.

As you mentioned, the use case you have has a limited timeline (not enough data). PyAF used past values to predict the future values. there is no miracle, we will not be able to provide meaningful forecast of summer (July) values, when we only have Nov and Dec values.

PyAF does its best by providing the mean of the 10 available values as a constant (even more linear ;) forecast for the 50 coming weeks.

There is simply no way to get something acceptable in this case. solution : increase data, this may imply waiting for the data generating process to produce more data.

antoinecarme commented 1 year ago

Technical aspects

Usually, a forecast is never wrong. It has an error as we are estimating the values of an unknown/future phenomenon. The quality of the forecast is measured using the error on a part of the dataset (10 points here).

Saying that a forecast is "obviously wrong" of "visually wrong" depends on the problem in question. I will appreciate if you can elaborate on this.

the main technical limit here is that 10 points is not statistically reliable enough to compute a prediction/mean/etc.

bjakobson commented 1 year ago

Hi @antoinecarme,

Appreciate the response! I totally understand that the lack of data will negatively affect the prediction of 50 values -- would lowering the anticipated values be a good solution (other than feeding more data since this is not possible yet, dataset will grow overtime), or would a different algorithm make more sense? Id like to point out that our data will generally be linear-ish, as shown in the data. This makes me believe linear regression is doable, but I am curious to know what you think will give decently accurate results.

antoinecarme commented 1 year ago

Hi @bjakobson

Once you have a "decent dataset", what you said will be OK.

PyAF does not make any assumption, it tests different models , including linear regression, and outputs the best model, the one with the lowest error (MAPE).

bjakobson commented 1 year ago

@antoinecarme, and what qualifies as "enough data"? Would, say, 31 days give a decent enough projection? This doesn't need to be precise, but the general outcome should at least look good -- if there was a slow increase (say 100 for 5 days, 105 for the next 5..), the outcome should look like a gradual increase, not linear or descending. Thanks again!

antoinecarme commented 1 year ago

@bjakobson

You can always increase your dataset artificially (different increasing sizes) and give your feedback.

antoinecarme commented 1 year ago

A classical rule of thumb is to use at least 30 points to compute a statistics indicator (mean). 100 points should be enough.

bjakobson commented 1 year ago

@antoinecarme, could you please elaborate on what you mean by "different increasing sizes", and "artificially increase your dataset"? I will definitely keep 30 points in mind - the problem is, I am looking to predict the weight a user could benchpress, and the data I am getting is by them actually logging each workout. I am fine with waiting 1 month to display the results, but I cannot justify this model if a user has to wait 100 bench press sessions to see their results, if that makes sense. I am more than open to any feedback on this!

antoinecarme commented 1 year ago

different increasing sizes = 10, 20, 30, ..., 100

bjakobson commented 1 year ago

is there a place where I can see how that is implemented? Is it just manually changing the data, or is there code needed?

bjakobson commented 1 year ago

I apologize for my constant questions, as you can likely tell, this is not my strong suit :)

antoinecarme commented 1 year ago

Forecasting problems are not always easy nor feasible. There ais a kind of tradeoff between the available dataset size and the horizon that is usable. It is a functional aspect of your problem. Cannot help with that. Sorry.

antoinecarme commented 1 year ago

You have to change your dataset manually in your python code.

bjakobson commented 1 year ago

right that makes sense. I am just confused how the line here is linear. i get that there is not a lot of data, but I would assume there would be enough to at least have a trend formed - ~2 months of data

bjakobson commented 1 year ago

and shortening the projections from 50 to 20 doesn't help

antoinecarme commented 1 year ago

Your dataset is still too short. I will not comment again on that.

Try copy-pasting the same data (weight_dataframe) 20 times and update the time column.

bjakobson commented 1 year ago

so that generally worked. is that a real solution - paste the same values 20 times?

bjakobson commented 1 year ago

this is without updating the time column too

antoinecarme commented 1 year ago

No artificial data is bad. It is just one way for you to see that pyaf will generate better models if you increase the size of your dataset.

DO NOT USE FAKE DATA IN PRODUCTION.

bjakobson commented 1 year ago

right, so taking a users data and cloning it 3 times is not a good strategy?

antoinecarme commented 1 year ago

Not only it is not a good strategy, but is it is not ethically correct.

Not enough data is a real problem everyone has. It is normal.

antoinecarme / pyaf

Projections Wrongfully Linear #218