NORCatUofC / rain

An open-source data science project about rainfall in Chicago
MIT License
6 stars 7 forks source link

Data missing from ohare (plenario) dataset #9

Closed kbrose closed 8 years ago

kbrose commented 8 years ago

There is no data in the ohare hourly precipitation dataset from early 1997 through 2000.

kbrose commented 8 years ago

I think I found all places where there was more than 24 consecutive hours missing. There were 7 spots in total, 6 were < 1 week, but one was the gap described above:

tm = (rain_df.index.astype(np.int64) // 10**9)
large_gaps = np.where(np.diff(tm) < -3600*24)[0]
rain_df.iloc[sorted(np.hstack((large_gaps, large_gaps+1)))]['Unnamed: 0']
datetime        integer index (?)
--------------------------
2016-08-01 00:51:00    440
2016-07-26 22:51:00    441
2016-07-01 00:51:00    192
2016-06-24 23:51:00    193
2016-06-01 00:51:00    852
2016-05-28 22:51:00    853
2016-05-01 00:42:00    715
2016-04-27 23:51:00    716
2016-04-01 00:14:00    579
2016-03-28 23:51:00    580
2016-03-01 00:51:00    481
2016-02-27 00:51:00    482
2000-01-01 00:56:00    210
1997-02-28 23:56:00    211
Name: Unnamed: 0, dtype: int64
pjsier commented 8 years ago

@kbrose Thanks for catching this! I just checked plenario again, and it definitely looks like they're missing that chunk on the API. I'm downloading the full datasets (both back to around the 1940s) for O'Hare and Midway now from NOAA's Local Climatological Data, and we can see if that fills in the gaps

pjsier commented 8 years ago

@kbrose I just uploaded the data to the Drive (links below). I haven't gotten the chance to fully look through it and see if there are any missing chunks, but at first glance the main gap you mentioned is in there.

pjsier commented 8 years ago

I ran the same numpy code to check on the updated dataset if there are any gaps (large_gaps = np.where(np.diff(tm) < -3600*24)[0]) and it didn't return any rows for O'Hare or Midway. I also reduced it to large_gaps = np.where(np.diff(tm) < -3601)[0] to see if there were any gaps at all greater than an hour and it didn't find any for either.

Looks like these are much more complete so I'll close this, and here are some links to versions cleaned up a bit and reduced just to HOURLYPrecip, location, and datetime: