Trending Data Issue and Small Anomalous Period affecting Moving Average

pcfierro commented 4 years ago

forecast_area__prediction_intervals_output forecast_angle__prediction_intervals_output

I am forecasting fire ellipses, using some angles and area. I find your system pyaf very interesting and well done in that it is compartmentalized and very module. Im new to python but not math, statistics and forecasting.

It seems to me that in the angles case the prediction intervals are following no Trend, yet due to one anomalous period in the very beginning, caused larger than needed confidence intervals.

Secondly the area does follow a trend but only the upper confidence interval is close to the expected trend line.

I have used AR, ARIMA, Holt, and Holt-Winters from scikit-learn with some better results. Is there a way to filter anomalous data, or exponential smoothing with weights that emphasise the more rect data. It seems your approach may require some adjusting or tuning or my part to get the forecast I am looking for.

Thanks

While issue reports are always welcome, and you are free to use any form to submit these, the following points are to be considered for an easier processing and more productivity:

The issue must be a bug or a feature request.
A description is needed as source code and/or a link to a dataset for which the problem arises (please simplify the code, anonymize the dataset etc).
information on different software versions used (pyaf, numpy, pandas, scikit-learn etc). The output of the following script should be enough : https://github.com/antoinecarme/pyaf/blob/master/tests/basic_checks/platform_info.py

pcfierro commented 4 years ago

sent data too:

ffe_area_angle2.zip

antoinecarme commented 4 years ago

@pcfierro

Thanks for using PyAF.

PyAF is an automatic/modular process that can be used to get some mechanical form of forecast.

It is not adjustable to match something someone (even myself) can like using a kind of non-computational quality measure. It is simply not expected nor designed to do that.

I had a look at the zip file including two datasets. I don't have access to the details of the training process (horizon etc). Can you please provide with a python script that you use to perform the training process ?

Of course, you can always perform some kind of preprocessing on the signal (remove the two lines with outliers etc) before using PyAF. The more regular the signal, the better.

pcfierro commented 4 years ago

Sure I can, paste the code in this reply. the Area is most concerning when it is in TREND, while the angle may have some wind based seasonality from minute to minute but basically holding at a moving average.

ANGLE Measure - AR with seasonality - ExponetialSmoothing function in tsa statsmodels AREA Measure - Trend without seasonality, in both cases I need a confidence interval

fcstWin = 60 # 3 Hours

df = pd.read_csv(csvAreaFile2, sep=r',', engine='python', skiprows=0); df.columns = ['fDateO','area'] df['fDate'] = range(df.shape[0]); print(df.head()); lDateVar = 'fDate' lSignalVar = 'area' lEngine = autof.cForecastEngine() lEngine.train(iInputDS = df , iTime='fDate', iSignal = 'area', iHorizon = fcstWin); lEngine.getModelInfo() # => relative error 7% (MAPE) df_forecast = lEngine.forecast(iInputDS = df , iHorizon = fcstWin) print(df_forecast.columns) # print(df_forecast['fDate'].tail(7).values) print(df_forecast['area_Forecast'].tail(7).values) print(lEngine.mSignalDecomposition.mTrPerfDetails.head()); lEngine.mSignalDecomposition.mBestModel.mTimeInfo.mResolution lEngine.standardPlots("forecastarea");

df = pd.read_csv(csvAngleFile2, sep=r',', engine='python', skiprows=0); df.columns = ['fDateO','angle'] df['fDate'] = range(df.shape[0]); print(df.head()); lDateVar = 'fDate' lSignalVar = 'angle' lEngine = autof.cForecastEngine() lEngine.train(iInputDS = df , iTime='fDate', iSignal = 'angle', iHorizon = fcstWin); lEngine.getModelInfo() # => relative error 7% (MAPE) df_forecast = lEngine.forecast(iInputDS = df , iHorizon = fcstWin) print(df_forecast.columns) # print(df_forecast['fDate'].tail(7).values) print(df_forecast['angle_Forecast'].tail(7).values) print(lEngine.mSignalDecomposition.mTrPerfDetails.head()); lEngine.mSignalDecomposition.mBestModel.mTimeInfo.mResolution lEngine.standardPlots("forecastangle");

csvAreaFile2 = 'C:\Users\Owner\OneDrive\PROJECTS\Paradise\ShapesDetection\SofVideo\Forecast\ffe_area2.csv'

csvAngleFile2 = 'C:\Users\Owner\OneDrive\PROJECTS\Paradise\ShapesDetection\SofVideo\Forecast\ffe_angle2.csv'

fcstWin = 60 # 3 Hours

df = pd.read_csv(csvAreaFile2, sep=r',', engine='python', skiprows=0); df.columns = ['fDateO','area'] df['fDate'] = range(df.shape[0]); print(df.head()); lDateVar = 'fDate' lSignalVar = 'area' lEngine = autof.cForecastEngine() lEngine.train(iInputDS = df , iTime='fDate', iSignal = 'area', iHorizon = fcstWin); lEngine.getModelInfo() # => relative error 7% (MAPE) df_forecast = lEngine.forecast(iInputDS = df , iHorizon = fcstWin) print(df_forecast.columns) # print(df_forecast['fDate'].tail(7).values) print(df_forecast['area_Forecast'].tail(7).values) print(lEngine.mSignalDecomposition.mTrPerfDetails.head()); lEngine.mSignalDecomposition.mBestModel.mTimeInfo.mResolution lEngine.standardPlots("forecastarea");

df = pd.read_csv(csvAngleFile2, sep=r',', engine='python', skiprows=0); df.columns = ['fDateO','angle'] df['fDate'] = range(df.shape[0]); print(df.head()); lDateVar = 'fDate' lSignalVar = 'angle' lEngine = autof.cForecastEngine() lEngine.train(iInputDS = df , iTime='fDate', iSignal = 'angle', iHorizon = fcstWin); lEngine.getModelInfo() # => relative error 7% (MAPE) df_forecast = lEngine.forecast(iInputDS = df , iHorizon = fcstWin) print(df_forecast.columns) # print(df_forecast['fDate'].tail(7).values) print(df_forecast['angle_Forecast'].tail(7).values) print(lEngine.mSignalDecomposition.mTrPerfDetails.head()); lEngine.mSignalDecomposition.mBestModel.mTimeInfo.mResolution lEngine.standardPlots("forecastangle");

From: CARME Antoine notifications@github.com Sent: Sunday, January 12, 2020 10:52 AM To: antoinecarme/pyaf pyaf@noreply.github.com Cc: Paul Fierro paulfierro@kubbla.com; Mention mention@noreply.github.com Subject: Re: [antoinecarme/pyaf] Trending Data Issue and Small Anomalous Period affecting Moving Average (#118)

@pcfierrohttps://github.com/pcfierro

Thanks for using PyAF.

PyAF It is an automatic/modular process that can be used to get some mechanical form of forecast.

It is not adjustable to match something someone (even myself) can like using a kind of non-computational quality measure. It is simply not expected nor designed to do that.

I had a look at the zip file includes two datasets. I don't have access to the details of the training process (horizon etc). Can you please provide with a python script that you use to perform the training process ?

Of course, you can always perform some kind of preprocessing on the signal (remove the two lines with outliers etc) before using PyAF. The more regular the signal, the better.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/antoinecarme/pyaf/issues/118?email_source=notifications&email_token=AOIGWP5QZ32VDI4I2OHRRN3Q5M4EHA5CNFSM4KFXWYS2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIW5FNY#issuecomment-573428407, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AOIGWP2THKUA5MM5XRO34M3Q5M4EHANCNFSM4KFXWYSQ.

antoinecarme commented 4 years ago

Thanks a lot for the scripts. I am playing with those. Some remarks :

The data are strange. You have data starting at 06:15:00 and ending at 09:31:00. A total of three hours, in each of both signals. There is no way to predict the next 3 hours (horizon) when you have too few data !!!! PyAF generates a Lag1 trend model in this case, saying, my best prediction is the last observed value ( => straight horizontal line on the forecast)


INFO:pyaf.std:BEST_DECOMPOSITION  '_area_Lag1Trend_residue_zeroCycle_residue_NoAR' [Lag1Trend + NoCycle + NoAR]
INFO:pyaf.std:TREND_DETAIL '_area_Lag1Trend' [Lag1Trend]
INFO:pyaf.std:CYCLE_DETAIL '_area_Lag1Trend_residue_zeroCycle' [NoCycle]
INFO:pyaf.std:AUTOREG_DETAIL '_area_Lag1Trend_residue_zeroCycle_residue_NoAR' [NoAR]
INFO:pyaf.std:MODEL_MAPE MAPE_Fit=0.0107 MAPE_Forecast=0.0057 MAPE_Test=0.0033
INFO:pyaf.std:MODEL_SMAPE SMAPE_Fit=0.012 SMAPE_Forecast=0.0057 SMAPE_Test=0.0031
INFO:pyaf.std:MODEL_MASE MASE_Fit=0.9986 MASE_Forecast=0.9946 MASE_Test=0.9833
INFO:pyaf.std:MODEL_L1 L1_Fit=129.77122995594152 L1_Forecast=228.7123323566562 L1_Test=151.35940099621064
INFO:pyaf.std:MODEL_L2 L2_Fit=1026.0115073384095 L2_Forecast=723.6191137588673 L2_Test=721.5447573136909


2. The data are not really a time series. The is a lot of consecutive identical lines like this for the same timestamp. Probably needs some cleanup (remove duplicates).

09:27:00,62.293643951416016 09:27:00,62.293643951416016 09:27:00,62.293643951416016 09:27:00,62.293643951416016 09:27:00,62.293643951416016

3. An outlier happens at 07:46:00. can be removed.

07:45:00,72.05471801757812 07:46:00,172.352783203125 07:46:00,172.352783203125 07:46:00,172.352783203125 07:46:00,172.352783203125 07:46:00,172.352783203125 07:47:00,50.6080436706543


4. Some feedback on data ? where do these data come from ?

Can you please cleanup the data and set the horizon to a reasonable value (minutes, not hours) and give me your feedback (new data and scripts welcome ;) ?

pcfierro commented 4 years ago

Thanks that is better, with those suggestions, seems now the pandas dataframe is not treating the ascii text as numeric time. I want tio forecast at the minute level. I aggregated the data to averages over the minutes approximately 12 seconds apart to simplify and that seemed to do better. Exponential smoothing and/or anomaly removal should help, but the area of the fire ellipse is trying to a balance moving average with a very important TREND. I have other methods but this has been a great exercise, any other ideas are welcome. I think I saw your example of converting date strings to date properly Im assuming I can do something similar with datetime function to convert properly to time.

New zip files may still not be time in my code yet, but aggregated. ffe_area2.zip

antoinecarme commented 4 years ago

If this can help, I used something like this to remove duplicates and outliers in python code without modifying the csv file

df = pd.read_csv(csvAreaFile2, sep=r',', engine='python', skiprows=0);
# remove duplicates
df = df.drop_duplicates()
# remove outliers
df = df[df['fDate'] != '07:46:00']
df.columns = ['fDateO','area']

antoinecarme commented 4 years ago

You really need more data (days). PyAF is a machine learning procedure, the model is estimated on a part of the dataset (first 2h) and validated on the remaining most recent part. You cannot expect a reliable (confidence interval) with this.

antoinecarme commented 4 years ago

Closing issue after no response for 30 days. Not blocking. Please repoen if needed.

antoinecarme / pyaf

Trending Data Issue and Small Anomalous Period affecting Moving Average #118