ArtesiaWater / hydropandas

Module for loading observation data into custom DataFrames
https://hydropandas.readthedocs.io
MIT License
51 stars 11 forks source link

Knmi data contains a duplicate index at 1-1-1970 #47

Closed rubencalje closed 3 years ago

rubencalje commented 3 years ago

Right now knmi-data is downloaded in ObsCollection.from_knmi(). KNMI-data from stations is combined to generate an equidistant series.

For some reason there is a duplicate index at 1-1-1970. I think the KNMI just sends this date twice. I think we should drop this duplicate index, to generate an equidistant series.

OnnoEbbens commented 3 years ago

What code do you use to download the KNMI data? I don't get a duplicate when using this code:

stns = [344, 260] #Rotterdam en de Bilt
oc_knmi = oc.ObsCollection.from_knmi(stns=stns, 
                                     meteo_vars=["EV24", "RH"], 
                                     start=['1970', '1970'],
                                     end=['2015', '2015'],
                                     verbose=True)
rubencalje commented 3 years ago

I think you have to start before 1970 to get the double index:

stns = [344, 260] #Rotterdam en de Bilt
oc_knmi = oc.ObsCollection.from_knmi(stns=stns, 
                                     meteo_vars=["EV24", "RH"], 
                                     start=['1969', '1969'],
                                     end=['1971', '1971'],
                                     verbose=True)
ndup = oc_knmi.iloc[0]['obs'].index.duplicated().sum()
print('')
print(f'{ndup} duplicated indexes')

gives

Download EV24 from 344 Rotterdam
transform EV24, Potential evapotranspiration (Makkink) (in 0.1 mm); from 0.1 to 1
transform EV24, Potential evapotranspiration (Makkink) (in mm); from mm to m
station 344 has 734 missing measurements
Download EV24 from 260 De Bilt
transform EV24, Potential evapotranspiration (Makkink) (in 0.1 mm); from 0.1 to 1
transform EV24, Potential evapotranspiration (Makkink) (in mm); from mm to m
station 260 has 0 missing measurements
Download RH from 344 Rotterdam
transform RH, Daily precipitation amount (in 0.1 mm) (-1 for <0.05 mm); from 0.1 to 1
transform RH, Daily precipitation amount (in mm) (-1 for <0.05 mm); from mm to m
station 344 has 734 missing measurements
Download RH from 260 De Bilt
transform RH, Daily precipitation amount (in 0.1 mm) (-1 for <0.05 mm); from 0.1 to 1
transform RH, Daily precipitation amount (in mm) (-1 for <0.05 mm); from mm to m
station 260 has 0 missing measurements

1 duplicated indexes
rubencalje commented 3 years ago

I do get the double index in the Observation when starting at 1970 as well, I see now.

OnnoEbbens commented 3 years ago

I do not get a double index in both cases. What package versions are you using? I use:

dbrakenhoff commented 3 years ago

I also get the duplicate index with pandas 1.1.2 but after upgrading to 1.1.4 there is no duplicate anymore....

Good idea to figure what in pandas is causing this difference...

rubencalje commented 3 years ago

Ah, it has to do with the normalize-method. For some reason pandas normalizes '1969-12-31 01:00:00' to '1970-01-01'. See this example:

from hydropandas.observation import KnmiObs
knmi = KnmiObs.from_knmi(260, 'EV24', startdate=pd.Timestamp('1970-1-1'),
                         enddate=pd.Timestamp('1970-1-1'), verbose=True)
print(knmi.index)
print(knmi.index.normalize())

Which gives:

transform EV24, Potential evapotranspiration (Makkink) (in 0.1 mm); from 0.1 to 1
transform EV24, Potential evapotranspiration (Makkink) (in mm); from mm to m
Download EV24 from 260 De Bilt
station 260 has 0 missing measurements
DatetimeIndex(['1969-12-31 01:00:00', '1970-01-01 01:00:00',
               '1970-01-02 01:00:00'],
              dtype='datetime64[ns]', freq='D')
DatetimeIndex(['1970-01-01', '1970-01-01', '1970-01-02'], dtype='datetime64[ns]', freq=None)

So it is a bug in Pandas, that has been fixed in the latest version. This can be closed.

rubencalje commented 3 years ago

It seems that all dates before 1970 are normalized to the next day in older versions of Pandas. This probably has to do with the fact that 1970 is 0 in the format in which the date is stored behind the scenes (is that english or just a dutch saying?).