cta-sst-1m / digicampipe

DigiCam pipeline based on ctapipe
GNU General Public License v3.0
3 stars 3 forks source link

[python 3.5] pd.to_datetime(data['time']), which triggers a timezone error #252

Closed calispac closed 6 years ago

calispac commented 6 years ago

In recent PRs #251 #250 #249 #248 #247 #246 #245

Link to the travis test : https://travis-ci.org/cta-sst-1m/digicampipe/jobs/437119169 Which are currently all blocked...

We identified a bug with pandas and python 3.5. However we don't know the exact source of the issue. Should we try to solve this? Or just drop python 3.5?

Like if you want to drop py3.5. Dislike if you prefer to solve the issue

yrenier commented 6 years ago

I tried to solve the issue, but without any success nor real understanding of it. I would prefer to have it solved, but as it's been a week already so I'm ok with dropping 3.5

dneise commented 6 years ago

I agree with yves we should solve the py3.5 issue and not drop it, just because we cannot figure out what is going on... maybe we can decide in the evening ... when everybody had a chance to try the luck and fix this problem

dneise commented 6 years ago

The problem with py3.5 does not only occur in new PRs .. but also in the master.

dneise commented 6 years ago

I deselected test_dataquality.test_data_quality() ... and found all other tests pass but this one. So maybe we only need to "fix" this test ... looking into it.

calispac commented 6 years ago

Yes this is where it originated. I am scratching my head now :smile:

yrenier commented 6 years ago

from what I understood, the problem is from: pd.to_datetime(data['time']), which triggers a timezone error when it is plotted()

dneise commented 6 years ago

The problem seemed to be plotting a plot with a datetime index, which is not timezone aware. I wanted to find the minimal code which reproduces the error, so I tried this:

def test_foo():
    import matplotlib.pyplot as plt
    import numpy as np
    import pandas as pd

    with tempfile.TemporaryDirectory() as tmpdirname:
        N = 100
        time = (
            np.random.randint(low=-1000, high=1000, size=N) +
            1509415494508984736
        )
        trigger_rate = np.random.normal(loc=77, size=N)

        data = pd.DataFrame(
            data={
                'time': time,
                'trigger_rate': trigger_rate,
            }
        )

        data['time'] = pd.to_datetime(data['time'])
        data = data.set_index('time')

        plt.figure()
        plt.plot(data['trigger_rate'] * 1E9)
        plt.ylabel('rate [Hz]')
        out_path = os.path.join(tmpdirname, 'foo.png')
        plt.savefig(out_path)

[updated 14:48]

this includes now the to_datetime() call ... but it runs nicely

dneise commented 6 years ago

Sorry .. false alarm .. I am still unable to find a small example to reproduce this error

dneise commented 6 years ago

Ah .. okay I slowly come closer ..

So in the above example, I use this to fake the time:

time = (
    np.random.randint(low=-1000, high=1000, size=N) +
    1509415494508984736
)

The number I took from the real test_data_quality test by putting a print(data) here:

data = Table.read(fits_filename, format='fits')
data = data.to_pandas()
print(data)   # <----- this is to look at data before to_datetime
data['time'] = pd.to_datetime(data['time'])

Okay so far so good .. so I took the number .. and randomly added some seconds to make a couple of different times.

dneise commented 6 years ago

Now I realized the number 1509415494508984736 is not the time in seconds ... the time in seconds is more like 1539694585 .. So the number is more like .. the time in nanoseconds... so I was only adding nanoseconds ..

So I multiplied the random numbers with 1000 .. and guess what .. it fails.

time = (
    np.random.randint(low=-1000, high=1000, size=N) * 1000 +  #  <--- * 1000 makes it fail
     1509415494508984736
)
dneise commented 6 years ago

So far so good so now I have a minimal example which nicely fails

dneise commented 6 years ago

Ah and when I multiply the random ints with 1e9 .. so they are really random seconds .. it does not fail anymore. There is a sweet spot between 1000 and 100000 where it fails .. very stupid

yrenier commented 6 years ago

nice ! However I'm still not sure if it is a pandas, pytz or matplotlib regression.

calispac commented 6 years ago

Ah and when I multiply the random ints with 1e9 .. so they are really random seconds .. it does not fail anymore.

Could you print the complete error please ?

dneise commented 6 years ago

Sure ... but actually I've put up the minimal code, which reproduces the error into this chat so that everybody who wants to study this problem can reproduce the error themselves and play with it, so here is the error:

15:10 $ python force_error.py 
                               trigger_rate
time                                       
2017-10-31 02:04:54.585584736     76.807937
2017-10-31 02:04:54.586784736     77.399938
2017-10-31 02:04:54.590584736     78.088984
2017-10-31 02:04:54.515484736     74.955927
Traceback (most recent call last):
  File "/home/dneise/anaconda3/envs/digicampipe/lib/python3.5/site-packages/matplotlib/backends/backend_qt5.py", line 519, in _draw_idle
    self.draw()
  File "/home/dneise/anaconda3/envs/digicampipe/lib/python3.5/site-packages/matplotlib/backends/backend_agg.py", line 402, in draw
    self.figure.draw(self.renderer)
  File "/home/dneise/anaconda3/envs/digicampipe/lib/python3.5/site-packages/matplotlib/artist.py", line 50, in draw_wrapper
    return draw(artist, renderer, *args, **kwargs)
  File "/home/dneise/anaconda3/envs/digicampipe/lib/python3.5/site-packages/matplotlib/figure.py", line 1652, in draw
    renderer, self, artists, self.suppressComposite)
  File "/home/dneise/anaconda3/envs/digicampipe/lib/python3.5/site-packages/matplotlib/image.py", line 138, in _draw_list_compositing_images
    a.draw(renderer)
  File "/home/dneise/anaconda3/envs/digicampipe/lib/python3.5/site-packages/matplotlib/artist.py", line 50, in draw_wrapper
    return draw(artist, renderer, *args, **kwargs)
  File "/home/dneise/anaconda3/envs/digicampipe/lib/python3.5/site-packages/matplotlib/axes/_base.py", line 2604, in draw
    mimage._draw_list_compositing_images(renderer, self, artists)
  File "/home/dneise/anaconda3/envs/digicampipe/lib/python3.5/site-packages/matplotlib/image.py", line 138, in _draw_list_compositing_images
    a.draw(renderer)
  File "/home/dneise/anaconda3/envs/digicampipe/lib/python3.5/site-packages/matplotlib/artist.py", line 50, in draw_wrapper
    return draw(artist, renderer, *args, **kwargs)
  File "/home/dneise/anaconda3/envs/digicampipe/lib/python3.5/site-packages/matplotlib/axis.py", line 1185, in draw
    ticks_to_draw = self._update_ticks(renderer)
  File "/home/dneise/anaconda3/envs/digicampipe/lib/python3.5/site-packages/matplotlib/axis.py", line 1023, in _update_ticks
    tick_tups = list(self.iter_ticks())  # iter_ticks calls the locator
  File "/home/dneise/anaconda3/envs/digicampipe/lib/python3.5/site-packages/matplotlib/axis.py", line 967, in iter_ticks
    majorLocs = self.major.locator()
  File "/home/dneise/anaconda3/envs/digicampipe/lib/python3.5/site-packages/matplotlib/dates.py", line 1230, in __call__
    return self._locator()
  File "/home/dneise/anaconda3/envs/digicampipe/lib/python3.5/site-packages/pandas/plotting/_converter.py", line 473, in __call__
    freq=freq, tz=tz).astype(object)
  File "/home/dneise/anaconda3/envs/digicampipe/lib/python3.5/site-packages/pandas/core/indexes/datetimes.py", line 2749, in date_range
    closed=closed, **kwargs)
  File "/home/dneise/anaconda3/envs/digicampipe/lib/python3.5/site-packages/pandas/core/indexes/datetimes.py", line 381, in __new__
    ambiguous=ambiguous)
  File "/home/dneise/anaconda3/envs/digicampipe/lib/python3.5/site-packages/pandas/core/indexes/datetimes.py", line 506, in _generate
    tz = timezones.maybe_get_tz(tz)
  File "pandas/_libs/tslibs/timezones.pyx", line 87, in pandas._libs.tslibs.timezones.maybe_get_tz
  File "pandas/_libs/tslibs/timezones.pyx", line 102, in pandas._libs.tslibs.timezones.maybe_get_tz
  File "/home/dneise/anaconda3/envs/digicampipe/lib/python3.5/site-packages/pytz/__init__.py", line 177, in timezone
    raise UnknownTimeZoneError(zone)
pytz.exceptions.UnknownTimeZoneError: 'UTC+00:00'

and here is my force_error.py:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

N = 4
time = (
    np.random.randint(low=-1000, high=1000, size=N) * 100000 +
    1509415494508984736
)
trigger_rate = np.random.normal(loc=77, size=N)

data = pd.DataFrame(
    data={
        'time': time,
        'trigger_rate': trigger_rate,
    }
)

data['time'] = pd.to_datetime(data['time'])
data = data.set_index('time')

print(data)

plt.figure()
plt.plot(data['trigger_rate'] * 1E9)
plt.ylabel('rate [Hz]')
plt.show()

Notes:

The reason seems to be somewhere in the part, where pandas//matplotlib want to make a date_range with equal spacing for the given dataset, so that they can plot the x-axis ...

dneise commented 6 years ago

Maybe the problem is here: https://github.com/pandas-dev/pandas/blob/8af2bea07f7864e1df8ee1c43546cad59043fa7a/pandas/plotting/_converter.py#L465-L469

tz is set independently of the timezone of the given dataset

dneise commented 6 years ago

Ah no .. that's not the problem .. I am just not fit to understand the code here.

dneise commented 6 years ago

Ah an another remark, I tried to fix this issue by making the data DatetimeIndex timezone away using tz_localize .. did not solve the issue.

yrenier commented 6 years ago

fixed by 5ff62c2 , closing the issue ?