dhanashreearole commented 6 years ago

Hello, I am using prophet model for forecasting the call volume for 30 days based on 3 years data. Would anyone happen to know how to interpret the black dots on the y vs. ds plot? More importantly, the yearly vs. Day of year plot in way too advanced for me to understand what it depicts? Please chime in and I will be glad to provide more details.

Thanks, D.A. Prophet Predictions Data Discovery Monthly.docx

Here is the script:

-- coding: utf-8 --

""" Spyder Editor

This is a temporary script file. """

def greetings(): """Print "Hello World" and return None""" ''' E-Business ''' print("Prophet Data Model")

main program starts here

greetings()

import configparser import pandas as pd import numpy as np from fbprophet import Prophet from fbprophet.diagnostics import cross_validation

cfg = configparser.RawConfigParser() cfgp = r'C:/Users/darole/.spyder-py3/scripts/config.txt' cfg.read(cfgp) config_changepoints = cfg.get('master-config', 'config_changepoints') config_changescale = cfg.get('master-config', 'config_changescale') x_dataframe = cfg.get('master-config', 'x_dataframe').split(',') our_dataframe = cfg.get('master-config', 'our_dataframe').split(',')

result_path = str(cfg.get('master-config', 'result_path')) result_name = str(cfg.get('master-config', 'result_name'))

input_file_m = pd.read_csv(result_name)
input_file_master = pd.read_csv(result_name)

input_file_direct = pd.read_csv('C:/Users/darole/.spyder-py3/scripts/pdx1_battery_may_14.csv')

input_file_master['y']= np.log(input_file_master['y']) # natural logarithm log base e

input_file_master.head()

hd = pd.DataFrame({ 'holiday': 'hd', 'ds': pd.to_datetime(our_dataframe), 'lower_window': 0, 'upper_window': 1, })

initialize Prophet

m = Prophet(holidays=hd, n_changepoints=int(config_changepoints), changepoint_prior_scale=float(config_changescale)) input_file_master['ds'] = pd.DatetimeIndex(input_file_master['ds']) #Index Data m.fit(input_file_master); #Fit the model

future = m.make_future_dataframe(periods=30) #Create a data frame for the future dates future.tail() # spot check forecast = m.predict(future) # make a prediction

This crossvalidation- can be useful for tuning parameters

input_file_cv = cross_validation(m, horizon = '60 days') input_file_cv.head()

holidays

forecast[(forecast['hd']).abs() > 0][['ds', 'hd']][-10:] forecast['y'] = pd.Series(input_file_master['y']) forecast['callvolo'] = pd.Series(input_file_m['y']) forecast['callvolf']= pd.Series( np.exp(forecast['yhat']))

trend = m.plot(forecast) # plots trend of yhat w.r.t. year yearly = m.plot_components(forecast) # plots percentage w.r.t. month

forecast.to_csv('C:/Users/darole/.spyder-py3/scripts/pdx1_battery_may_14_545_120_ourdataframe.csv')

bletham commented 6 years ago

The black dots on the first plot show the actual y values that you gave as the input data. They are the same as if you made a plot of df['ds'] vs. df['y'].

The "yearly" vs. "Day of year" plot shows one cycle of the yearly seasonality: How much the yearly cycle goes above or below the baseline trend at each point in the year. For example, the peak in October says that every October the time series values are ~25% higher just due to the effect of the yearly seasonality. (Note, however that you have monthly data and so should carefully look at the section here about monthly data: https://facebook.github.io/prophet/docs/non-daily_data.html ).

dhanashreearole commented 6 years ago

Thanks for your reply Ben.

Is the light blue line line reflecting yhatupper? Is the dark blue line reflecting yhatlower?

After first iteration of prediction, would you advise eliminating outliers, for example capping the y values in input dataset to yhatupper or to yhatlower depending on where it is plotted on the chart? The records with y values in top area of the chart would be replaced by yhatupper.

I think it is not accurate to entirely eliminate the outliers because it will make the dataset discrete and choppy.

bletham commented 6 years ago

The dark blue line is yhat. The light blue at the top is yhat_upper, and the light blue at the bottom is yhat_lower. You can remove outliers if they are affecting the forecast, but the outliers here seem to be safely ignored so I wouldn't worry about them. See https://facebook.github.io/prophet/docs/outliers.html for examples of how outliers can mess up the forecast, and there's none of that here.

To be clear, I would not consider points that lie outside of yhat_lower and yhat_upper to be outliers: that is an 80% interval so we expect 20% of the data to lie outside, and those points are not outliers. Outliers would be points that are well outside the prediction interval, like the two points below 1.

dhanashreearole commented 5 years ago

Sounds great. Is there a way to make sure that when monthly frequency is used for predictions, then instead of starting at end of month, it can be adjusted to match with the pattern of historical data:

As you can see, tje input dataset has start of months that ends at index 223. From 224 onwards, prophet starts predicting the yhat, however it predicts it for August 31, September 31 so on and so forth. Some changes that were surfaced compared to daily prediction is that freq is set to M:

future = m.make_future_dataframe(periods=29, freq = 'M') #Create a data frame for the future dates

This crossvalidation- can be useful for tuning parameters

input_file_cv = cross_validation(m, horizon = '29 days')

Prophet will nicely predict it until end of 2020 with excellent accuracy, however I wish we could allow it to pass months string for horizon. I have stepped into forecaster.py and diagnostics.py to see if it can be adjusted, but no luck. Also the horizon parameter doesn't like months string?

Can you please suggest better way to handle it?

dhanashreearole commented 5 years ago

From: Ben Letham notifications@github.com Sent: Wednesday, October 17, 2018 8:07 PM To: facebook/prophet prophet@noreply.github.com Cc: Arole, Dhanashree (Kolter) DArole@national.aaa.com; State change state_change@noreply.github.com Subject: Re: [facebook/prophet] Interpretation of plot function visual (#691)

The dark blue line is yhat. The light blue at the top is yhat_upper, and the light blue at the bottom is yhat_lower. You can remove outliers if they are affecting the forecast, but the outliers here seem to be safely ignored so I wouldn't worry about them. See https://facebook.github.io/prophet/docs/outliers.html https://urldefense.proofpoint.com/v2/url?u=https-3A__facebook.github.io_prophet_docs_outliers.html&d=DwMFaQ&c=rlZAUarxv0HOJXjDdf7mE9Es74rYvd5gG3lFJaIo-yg&r=3WRKkWAw3ra4jUDWD6vVxsOCc2zM1Jkf6Lk2r_aULgo&m=9bWaI01IMWkipJSmGDy8rK5MncPgKB2ve_TABbx8TMs&s=b2RK2S2w0k-BQIVI2LExCmTNZ6VQFLGp0Sxl2sSGTHk&e= for examples of how outliers can mess up the forecast, and there's none of that here.

To be clear, I would not consider points that lie outside of yhat_lower and yhat_upper to be outliers: that is an 80% interval so we expect 20% of the data to lie outside, and those points are not outliers. Outliers would be points that are well outside the prediction interval, like the two points below 1.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_facebook_prophet_issues_691-23issuecomment-2D430830855&d=DwMFaQ&c=rlZAUarxv0HOJXjDdf7mE9Es74rYvd5gG3lFJaIo-yg&r=3WRKkWAw3ra4jUDWD6vVxsOCc2zM1Jkf6Lk2r_aULgo&m=9bWaI01IMWkipJSmGDy8rK5MncPgKB2ve_TABbx8TMs&s=rSqJTH0mrlI4vVdeM9K_RPtYC0dJKeSdGC_9LLvlFl4&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_Apw7noiDeNzYivth7WQ48P750vDQpSQDks5ul8YWgaJpZM4XEuTb&d=DwMFaQ&c=rlZAUarxv0HOJXjDdf7mE9Es74rYvd5gG3lFJaIo-yg&r=3WRKkWAw3ra4jUDWD6vVxsOCc2zM1Jkf6Lk2r_aULgo&m=9bWaI01IMWkipJSmGDy8rK5MncPgKB2ve_TABbx8TMs&s=a2hGAWmqc_fNn-9DXQaqIfiXOTqH1uW6_UZDTaYUuYg&e=.

[AAA] Get the AAA Mobile app! [http://www.aaa.com/AAA/images/applebadge.png]http://www.aaa.com/configuration/SEM/AAAEmailMobileAppDownload.html?app=IOS[http://www.aaa.com/AAA/images/googlebadge.png]http://www.aaa.com/configuration/SEM/AAAEmailMobileAppDownload.html?app=ANDROID

AAA Disclaimer Communication This communication (including all attachments) is intended solely for the use of the person(s) to whom it is addressed and should be treated as a confidential AAA communication. If you are not the intended recipient, any use, distribution, printing, or copying of this email is strictly prohibited. If you received this email in error, please immediately delete it from your system and notify the originator. Your cooperation is appreciated.

bletham commented 5 years ago

The make_future_dataframe uses pandas date_range to generate the dates, which supports these frequencies: https://pandas.pydata.org/pandas-docs/stable/timeseries.html#timeseries-offset-aliases

As you can see there, M is month-end frequency. If you want month-start, it is MS.

For cross validation, it uses pandas Timedelta which only supports days or smaller - see #650 for some more discussion on that. There is an open issue at #586 to better support monthly cross-validation, but in the meantime a horizon of 31 days would do the trick.

dhanashreearole commented 5 years ago

That is perfect and very helpful. Thanks Ben so much!!!

While experimenting with monthly frequency and rolling aggregates, I realized that Prophet changes the forecast output.

Rolling1

The mean absolute percent error is excellent:

Rolling2

Mean Absolute Percent Error

Rolling 3

Would you happen to know what makes prophet change the membership counts for the month of July as shown in Forecast_Rolling1, Forecast_Rolling2, Forecast_Rolling3:

Is it any way indicator of accuracy (possibly not)?

This is my very first attempt to precisely predict monthly memberships with MS frequency. I have taken into consideration holiday effect with start and end dates for each month only.

Thanks in advance for your valuable time, effort and energy!

bletham commented 5 years ago

I don't fully understand what these numbers are in the spreadsheet. This is what I think was done, but please correct if I misunderstand:

The model was fit to three different time intervals: (1) Aug 84 through Jul 18, (2) Sep 84 through Aug 18, and (3) Oct 84 through Sep 18. For each of these, forecasts were made for 3 years.
How did you estimate MAPE? Was it for (1) the MAPE of the forecasts in Aug 18, Sep 18, and Oct 18 (which were not including in the fit data), and for (2) that of Sept 18 and Oct 18, and (3) just Oct 18? Or are you taking the MAPE of all of the dates in the history too? If you include the months in the history, (Jul 18 and earlier) then it isn't really a fair evaluation (it will make Prophet look a lot better than it actually is!) because the model already knows what the values in the history were and has an easy time of getting the right answer.
Finally, what values are being shown in the spreadsheet here? Is this the Prophet forecast for the dates in the history? If so, then the reason it changes is because Prophet is a regression model with a noise term, which means it does not pass through the historical data perfectly. Just like a linear regression will not pass through the points perfectly (https://en.wikipedia.org/wiki/Linear_regression#/media/File:Linear_regression.svg), Prophet will allow there to be small fluctuations due to noise. Each rolling forecast has a different set of training data, and so a slightly different model and so the forecast at each historical point may differ slightly (like a slightly different slope in a linear regression). If you want to visualize this, use m.plot(forecast) and you will see that the historical data (black dots) do not perfectly match the prophet prediction (blue line).

dhanashreearole commented 5 years ago

Yes, we are concurring on few issues, thanks for your valuable insights Ben!

I am in the middle of understanding Tukey Ladder of Power. Depending on the skewness, change the transformation. Simple question is that will Prophet work best if the data is as close as possible to being normally distributed?

bletham commented 5 years ago

It will work best if the variance around the main estimate (yhat) is normally distributed, since that is assumed by the model. But if you just make a histogram of all of your data it could be very different from normal due to trends and seasonality.

facebook / prophet

Interpretation of plot function visual #691

-- coding: utf-8 --

main program starts here

input_file_direct = pd.read_csv('C:/Users/darole/.spyder-py3/scripts/pdx1_battery_may_14.csv')

initialize Prophet

This crossvalidation- can be useful for tuning parameters

holidays

This crossvalidation- can be useful for tuning parameters