antoinecarme / pyaf

PyAF is an Open Source Python library for Automatic Time Series Forecasting built on top of popular pydata modules.
BSD 3-Clause "New" or "Revised" License
457 stars 73 forks source link

Dataset containing high precision (nanoseconds) dates fails to train #175

Closed artrune closed 2 years ago

artrune commented 2 years ago

I was testing on some data, and I kept getting exceptions saying train failed, after looking around I realized it was because of the high precision dates.


INFO:pyaf.std:START_TRAINING 'value'
Traceback (most recent call last):
  File "C:\PYTHON3\lib\site-packages\pyaf\ForecastEngine.py", line 25, in train
    self.mSignalDecomposition.train(iInputDS, iTime, iSignal, iHorizon, iExogenousData);
  File "C:\PYTHON3\lib\site-packages\pyaf\TS\SignalDecomposition.py", line 631, in train
    self.checkData(iInputDS, iTime, iSignal, iHorizon, iExogenousData);
  File "C:\PYTHON3\lib\site-packages\pyaf\TS\SignalDecomposition.py", line 604, in checkData
    type1 = np.dtype(iInputDS[iTime])
TypeError: Cannot interpret '0     2021-08-01 00:14:36.879515613+00:00
1     2021-08-01 00:13:22.755664335+00:00
2     2021-08-01 00:12:08.483382948+00:00
3     2021-08-01 00:10:54.242433585+00:00
4     2021-08-01 00:09:40.135882425+00:00
                      ...                
115   2021-07-31 21:51:43.580248426+00:00
116   2021-07-31 21:50:29.020741582+00:00
117   2021-07-31 21:49:15.175994058+00:00
118   2021-07-31 21:48:00.528170592+00:00
119   2021-07-31 21:46:46.214305238+00:00
Name: date, Length: 120, dtype: datetime64[ns, UTC]' as a data type

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\PYTHON3\lib\site-packages\pyaf\ForecastEngine.py", line 30, in train
    raise tsutil.PyAF_Error("TRAIN_FAILED");
pyaf.TS.Utils.PyAF_Error: TRAIN_FAILED

I changed the precision by casting my dates up to seconds and then train worked fine: df['date'] = df['date'].values.astype('<M8[s]') Seems that the underlying problem is some numpy function, not too sure..

antoinecarme commented 2 years ago

Hi @artrune

Interesting issue. I need to know the version of numpy / pyaf you are using. An anonymized version of your data and/or a script that fails are also welcome.

The basic requirements are here :

https://github.com/antoinecarme/pyaf/blob/master/ISSUE_TEMPLATE.md

artrune commented 2 years ago

Here's a subset of the data I was using:

{
    "results": [
        {
            "statement_id": 0,
            "series": [
                {
                    "name": "temperature",
                    "columns": [
                        "time",
                        "value"
                    ],
                    "values": [
                        [
                            "2021-08-01T15:19:40.791108016Z",
                            43.5901031494
                        ],
                        [
                            "2021-08-01T15:18:26.407537578Z",
                            43.5352668762
                        ],
                        [
                            "2021-08-01T15:17:12.252326253Z",
                            43.6632194519
                        ],
                        [
                            "2021-08-01T15:15:57.981777501Z",
                            43.4987106323
                        ],
                        [
                            "2021-08-01T15:14:43.74242866Z",
                            43.5535430908
                        ],
                        [
                            "2021-08-01T15:13:29.624277764Z",
                            43.5718231201
                        ],
                        [
                            "2021-08-01T15:12:15.401747322Z",
                            43.7363357544
                        ],
                        [
                            "2021-08-01T15:11:01.165682994Z",
                            43.480430603
                        ],
                        [
                            "2021-08-01T15:09:46.742506101Z",
                            43.4621505737
                        ],
                        [
                            "2021-08-01T15:08:32.613850591Z",
                            43.5535430908
                        ]
                    ]
                }
            ]
        }
    ]
}

Here are the versions of the packages:

numpy: 1.19.2
pandas: 1.2.4
pyaf: 2.0.1

I was loading the data into a dataframe using:

df_result = pd.DataFrame(json['results'][0]['series'][0]['values'], columns=['time', 'value'])
df_result ['date'] = pd.to_datetime(df_result ['time'], utc=True)

training was failing so I added the cast

df_result ['date']=df_result ['date'].values.astype('<M8[s]')

antoinecarme commented 2 years ago

@artrune

Thanks for the data. I will see what I can do with that.

  1. A new version of PyAf (3.0) is available now. Not sure this will help here, but it is always better to test the latest versions. Please upgrade numpy and pandas to the latest versions too.
  2. Numpy and Pandas do not like nanoseconds. I will say more about that later (need some time, in a separate comment). Nanoseconds are not good for business data (IRL, It does not always make sense to wait for 12 nanoseoconds ;)
  3. Nice to seee that it works when nanoseconds are removed. What about "utc" flag ?
    1. Even if there will not be a "real" fix, I will try to make PyAf robust to this kind of stuff (ignore nanoseconds and a warning message).
artrune commented 2 years ago

Thanks, I have no real interest in dealing with nanoseconds, the underlying database I was polling (influx 1.8) stores dates with that precision.

I remember trying to remove the utc flag, I don't think it helped.

antoinecarme commented 2 years ago

Ok for the nanoseconds,. Will help with prioritizing this issue. Seems to be a numpy issue as you noted in the first commit.

As long as we have a workaround, this bug will be fixed in PyAF 4.0 (release date : July 2022 ;).

antoinecarme commented 2 years ago

Thanks for this feedback. I have been interested at some time (long time ago) in IoT and Time series usage with PyAF

https://github.com/antoinecarme/pyaf/issues/3

Do you have some pointers/demos or notebook links to your usage of PyaF with InfluxDB ?

artrune commented 2 years ago

To be honest I wasn't using any demo data, I just happened to have been collecting my raspberry pi temperature from some months now, if you want that data I can share it without any problem.

antoinecarme commented 2 years ago

@artrune

LOL. So, just by playing with your Raspberry Pi and a Time Series database, you are helping fix a problem for PyAF and probably numpy users. Don't underestimate the power of usage !!!

Thank you very much for helping make PyAF better and don't hesitate to come back and report similar issues.

Your Raspberry Pi rocks !!!

Antoine

artrune commented 2 years ago

Absolutely! I'll let my RaspberryPi know he's a good boy!

Thanks and stay safe-