facebook / prophet

Tool for producing high quality forecasts for time series data that has multiple seasonality with linear or non-linear growth.
https://facebook.github.io/prophet
MIT License
18.38k stars 4.52k forks source link

Interpretation of plot function visual #691

Closed dhanashreearole closed 5 years ago

dhanashreearole commented 6 years ago

Hello, I am using prophet model for forecasting the call volume for 30 days based on 3 years data. Would anyone happen to know how to interpret the black dots on the y vs. ds plot? More importantly, the yearly vs. Day of year plot in way too advanced for me to understand what it depicts? Please chime in and I will be glad to provide more details.

Thanks, D.A. Prophet Predictions Data Discovery Monthly.docx

Here is the script:

-- coding: utf-8 --

""" Spyder Editor

This is a temporary script file. """

def greetings(): """Print "Hello World" and return None""" ''' E-Business ''' print("Prophet Data Model")

main program starts here

greetings()

import configparser import pandas as pd import numpy as np from fbprophet import Prophet from fbprophet.diagnostics import cross_validation

cfg = configparser.RawConfigParser() cfgp = r'C:/Users/darole/.spyder-py3/scripts/config.txt' cfg.read(cfgp) config_changepoints = cfg.get('master-config', 'config_changepoints') config_changescale = cfg.get('master-config', 'config_changescale') x_dataframe = cfg.get('master-config', 'x_dataframe').split(',') our_dataframe = cfg.get('master-config', 'our_dataframe').split(',')

result_path = str(cfg.get('master-config', 'result_path')) result_name = str(cfg.get('master-config', 'result_name'))

input_file_m = pd.read_csv(result_name)
input_file_master = pd.read_csv(result_name)

input_file_direct = pd.read_csv('C:/Users/darole/.spyder-py3/scripts/pdx1_battery_may_14.csv')

input_file_master['y']= np.log(input_file_master['y']) # natural logarithm log base e

input_file_master.head()

hd = pd.DataFrame({ 'holiday': 'hd', 'ds': pd.to_datetime(our_dataframe), 'lower_window': 0, 'upper_window': 1, })

initialize Prophet

m = Prophet(holidays=hd, n_changepoints=int(config_changepoints), changepoint_prior_scale=float(config_changescale)) input_file_master['ds'] = pd.DatetimeIndex(input_file_master['ds']) #Index Data m.fit(input_file_master); #Fit the model

future = m.make_future_dataframe(periods=30) #Create a data frame for the future dates future.tail() # spot check forecast = m.predict(future) # make a prediction

This crossvalidation- can be useful for tuning parameters

input_file_cv = cross_validation(m, horizon = '60 days') input_file_cv.head()

holidays

forecast[(forecast['hd']).abs() > 0][['ds', 'hd']][-10:] forecast['y'] = pd.Series(input_file_master['y']) forecast['callvolo'] = pd.Series(input_file_m['y']) forecast['callvolf']= pd.Series( np.exp(forecast['yhat']))

trend = m.plot(forecast) # plots trend of yhat w.r.t. year yearly = m.plot_components(forecast) # plots percentage w.r.t. month

forecast.to_csv('C:/Users/darole/.spyder-py3/scripts/pdx1_battery_may_14_545_120_ourdataframe.csv')

bletham commented 6 years ago

The black dots on the first plot show the actual y values that you gave as the input data. They are the same as if you made a plot of df['ds'] vs. df['y'].

The "yearly" vs. "Day of year" plot shows one cycle of the yearly seasonality: How much the yearly cycle goes above or below the baseline trend at each point in the year. For example, the peak in October says that every October the time series values are ~25% higher just due to the effect of the yearly seasonality. (Note, however that you have monthly data and so should carefully look at the section here about monthly data: https://facebook.github.io/prophet/docs/non-daily_data.html ).

dhanashreearole commented 6 years ago

Thanks for your reply Ben.

Is the light blue line line reflecting yhatupper? Is the dark blue line reflecting yhatlower?

image

After first iteration of prediction, would you advise eliminating outliers, for example capping the y values in input dataset to yhatupper or to yhatlower depending on where it is plotted on the chart? The records with y values in top area of the chart would be replaced by yhatupper.

I think it is not accurate to entirely eliminate the outliers because it will make the dataset discrete and choppy.

bletham commented 6 years ago

The dark blue line is yhat. The light blue at the top is yhat_upper, and the light blue at the bottom is yhat_lower. You can remove outliers if they are affecting the forecast, but the outliers here seem to be safely ignored so I wouldn't worry about them. See https://facebook.github.io/prophet/docs/outliers.html for examples of how outliers can mess up the forecast, and there's none of that here.

To be clear, I would not consider points that lie outside of yhat_lower and yhat_upper to be outliers: that is an 80% interval so we expect 20% of the data to lie outside, and those points are not outliers. Outliers would be points that are well outside the prediction interval, like the two points below 1.

dhanashreearole commented 5 years ago

Sounds great. Is there a way to make sure that when monthly frequency is used for predictions, then instead of starting at end of month, it can be adjusted to match with the pattern of historical data:

image

As you can see, tje input dataset has start of months that ends at index 223. From 224 onwards, prophet starts predicting the yhat, however it predicts it for August 31, September 31 so on and so forth. Some changes that were surfaced compared to daily prediction is that freq is set to M:

future = m.make_future_dataframe(periods=29, freq = 'M') #Create a data frame for the future dates

This crossvalidation- can be useful for tuning parameters

input_file_cv = cross_validation(m, horizon = '29 days')

Prophet will nicely predict it until end of 2020 with excellent accuracy, however I wish we could allow it to pass months string for horizon. I have stepped into forecaster.py and diagnostics.py to see if it can be adjusted, but no luck. Also the horizon parameter doesn't like months string?

Can you please suggest better way to handle it?

dhanashreearole commented 5 years ago

Dhanashree Arole Dhanashree Arole | Business Intelligence | AAA National Office | 1000 AAA Drive | Heathrow, FL 32746-5063 darole@national.aaa.commailto:darole@national.aaa.com | www.AAA.comhttp://www.aaa.com/ [cid:image001.png@01CF5E03.E9E00800]

From: Ben Letham notifications@github.com Sent: Wednesday, October 17, 2018 8:07 PM To: facebook/prophet prophet@noreply.github.com Cc: Arole, Dhanashree (Kolter) DArole@national.aaa.com; State change state_change@noreply.github.com Subject: Re: [facebook/prophet] Interpretation of plot function visual (#691)

The dark blue line is yhat. The light blue at the top is yhat_upper, and the light blue at the bottom is yhat_lower. You can remove outliers if they are affecting the forecast, but the outliers here seem to be safely ignored so I wouldn't worry about them. See https://facebook.github.io/prophet/docs/outliers.htmlhttps://urldefense.proofpoint.com/v2/url?u=https-3A__facebook.github.io_prophet_docs_outliers.html&d=DwMFaQ&c=rlZAUarxv0HOJXjDdf7mE9Es74rYvd5gG3lFJaIo-yg&r=3WRKkWAw3ra4jUDWD6vVxsOCc2zM1Jkf6Lk2r_aULgo&m=9bWaI01IMWkipJSmGDy8rK5MncPgKB2ve_TABbx8TMs&s=b2RK2S2w0k-BQIVI2LExCmTNZ6VQFLGp0Sxl2sSGTHk&e= for examples of how outliers can mess up the forecast, and there's none of that here.

To be clear, I would not consider points that lie outside of yhat_lower and yhat_upper to be outliers: that is an 80% interval so we expect 20% of the data to lie outside, and those points are not outliers. Outliers would be points that are well outside the prediction interval, like the two points below 1.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_facebook_prophet_issues_691-23issuecomment-2D430830855&d=DwMFaQ&c=rlZAUarxv0HOJXjDdf7mE9Es74rYvd5gG3lFJaIo-yg&r=3WRKkWAw3ra4jUDWD6vVxsOCc2zM1Jkf6Lk2r_aULgo&m=9bWaI01IMWkipJSmGDy8rK5MncPgKB2ve_TABbx8TMs&s=rSqJTH0mrlI4vVdeM9K_RPtYC0dJKeSdGC_9LLvlFl4&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_Apw7noiDeNzYivth7WQ48P750vDQpSQDks5ul8YWgaJpZM4XEuTb&d=DwMFaQ&c=rlZAUarxv0HOJXjDdf7mE9Es74rYvd5gG3lFJaIo-yg&r=3WRKkWAw3ra4jUDWD6vVxsOCc2zM1Jkf6Lk2r_aULgo&m=9bWaI01IMWkipJSmGDy8rK5MncPgKB2ve_TABbx8TMs&s=a2hGAWmqc_fNn-9DXQaqIfiXOTqH1uW6_UZDTaYUuYg&e=.

[AAA] Get the AAA Mobile app! [http://www.aaa.com/AAA/images/applebadge.png]http://www.aaa.com/configuration/SEM/AAAEmailMobileAppDownload.html?app=IOS[http://www.aaa.com/AAA/images/googlebadge.png]http://www.aaa.com/configuration/SEM/AAAEmailMobileAppDownload.html?app=ANDROID

AAA Disclaimer Communication This communication (including all attachments) is intended solely for the use of the person(s) to whom it is addressed and should be treated as a confidential AAA communication. If you are not the intended recipient, any use, distribution, printing, or copying of this email is strictly prohibited. If you received this email in error, please immediately delete it from your system and notify the originator. Your cooperation is appreciated.

bletham commented 5 years ago

The make_future_dataframe uses pandas date_range to generate the dates, which supports these frequencies: https://pandas.pydata.org/pandas-docs/stable/timeseries.html#timeseries-offset-aliases

As you can see there, M is month-end frequency. If you want month-start, it is MS.

For cross validation, it uses pandas Timedelta which only supports days or smaller - see #650 for some more discussion on that. There is an open issue at #586 to better support monthly cross-validation, but in the meantime a horizon of 31 days would do the trick.

dhanashreearole commented 5 years ago

That is perfect and very helpful. Thanks Ben so much!!!

While experimenting with monthly frequency and rolling aggregates, I realized that Prophet changes the forecast output.

Rolling1

image

The mean absolute percent error is excellent:

image

Rolling2

image

Mean Absolute Percent Error image

Rolling 3

image

image

Would you happen to know what makes prophet change the membership counts for the month of July as shown in Forecast_Rolling1, Forecast_Rolling2, Forecast_Rolling3:

image

Is it any way indicator of accuracy (possibly not)?

This is my very first attempt to precisely predict monthly memberships with MS frequency. I have taken into consideration holiday effect with start and end dates for each month only.

Thanks in advance for your valuable time, effort and energy!

bletham commented 5 years ago

I don't fully understand what these numbers are in the spreadsheet. This is what I think was done, but please correct if I misunderstand:

dhanashreearole commented 5 years ago

Yes, we are concurring on few issues, thanks for your valuable insights Ben!

I am in the middle of understanding Tukey Ladder of Power. Depending on the skewness, change the transformation. Simple question is that will Prophet work best if the data is as close as possible to being normally distributed?

bletham commented 5 years ago

It will work best if the variance around the main estimate (yhat) is normally distributed, since that is assumed by the model. But if you just make a histogram of all of your data it could be very different from normal due to trends and seasonality.