Timeseries data check failing - coersion issue

lwasser commented 4 years ago

# Get data
data = et.data.get_data("colorado-flood")
os.chdir(os.path.join(et.io.HOME, 'earth-analytics'))

import matplotcheck.notebook as nb
import matplotcheck.timeseries as ts
import matplotcheck.raster as ra
# BEGIN SOLUTION
import os
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from matplotlib.dates import DateFormatter
import seaborn as sns
import numpy as np
import pandas as pd
import earthpy as et
import hydrofunctions as hf
import urllib

from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()

# prettier plotting with seaborn
sns.set(font_scale=1.5)
sns.set_style("whitegrid")

# import file
f = "data/colorado-flood/discharge/06730200-discharge-daily-1986-2013.txt"
discharge = pd.read_csv(f,
                        skiprows=23,
                        header=[1, 2],
                        sep='\t',
                        parse_dates=[2])
# drop one level of index
discharge.columns = discharge.columns.droplevel(1)
# set the date column as the index
discharge = discharge.set_index(["datetime"])

monthly_max_all = discharge.resample("M").max()
monthly_max = monthly_max_all['1990':'2014']

fig, ax = plt.subplots(figsize=(10, 8))
ax.scatter(x=monthly_max.index,
           y=monthly_max["17663_00060_00003"],
           color="purple")
ax.set_title(
    "HW Plot 1: Stream Discharge - Monthly Max Value\n be sure to add x and y labels (not shown here)")

ax.set(xlabel="Date")

# END SOLUTION

### DO NOT REMOVE LINE BELOW ###
plot_1_ts = nb.convert_axes(plt, which_axes="current")

mpc_plot_1_ts = mts.TimeSeriesTester(plot_1_ts)
results = []

TEST

mts.TimeSeriesTester(plot_1_ts)
mpc_plot_1_ts.assert_xydata(xy_expected = precip.reset_index(),
                           xtime=True,
                           xcol="DATE",
                           ycol="HPCP")

Output error: i suspect this has something to do with NAN values but i am not sure

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-38-1b4d8c2d48dc> in <module>
      4                            xtime=True,
      5                            xcol="DATE",
----> 6                            ycol="HPCP")
      7 
      8 #precip

~/Documents/github/0-python/matplotcheck/matplotcheck/base.py in assert_xydata(self, xy_expected, xcol, ycol, points_only, xtime, xlabels, tolerence, message)
    938             try:
    939                 np.testing.assert_array_max_ulp(
--> 940                     np.array(xy_data["x"]), np.array(xy_expected[xcol])
    941                 )
    942             except AssertionError:

~/miniconda3/envs/earth-analytics-python/lib/python3.7/site-packages/numpy/testing/_private/utils.py in assert_array_max_ulp(a, b, maxulp, dtype)
   1617     __tracebackhide__ = True  # Hide traceback for py.test
   1618     import numpy as np
-> 1619     ret = nulp_diff(a, b, dtype)
   1620     if not np.all(ret <= maxulp):
   1621         raise AssertionError("Arrays are not almost equal up to %g ULP" %

~/miniconda3/envs/earth-analytics-python/lib/python3.7/site-packages/numpy/testing/_private/utils.py in nulp_diff(x, y, dtype)
   1658         y = np.array(y)
   1659 
-> 1660     t = np.common_type(x, y)
   1661     if np.iscomplexobj(x) or np.iscomplexobj(y):
   1662         raise NotImplementedError("_nulp not implemented for complex array")

<__array_function__ internals> in common_type(*args, **kwargs)

~/miniconda3/envs/earth-analytics-python/lib/python3.7/site-packages/numpy/lib/type_check.py in common_type(*arrays)
    719             p = array_precision.get(t, None)
    720             if p is None:
--> 721                 raise TypeError("can't get common type for non-numeric array")
    722         precision = max(precision, p)
    723     if is_complex:

TypeError: can't get common type for non-numeric array

i suspect this has to do with NA values. we need to clear those out

nkorinek commented 4 years ago

Update: This bug is very complicated.

The initial suspicion that this was due to NA values in the dataset turned out to be incorrect. I've test the dataset against itself, and there are no NA values in the dataset to start with. This bug still comes up when there are no NA values in the datasets and the datasets are identical.

The next guess that was looked into was related to how the data is transformed during the assert_xydata check. It was noticed that the datetime objects were not timezone aware, i.e. the datatype was datetime64[ns], which turned into datetime64[ns, UTC] during the check. Seeing as the error was TypeError: can't get common type for non-numeric array, it seemed this may be the culprit. When importing the data, I added the line dataframe.index = dataframe.index.tz_localize('UTC') to the data right after the import was done. This made the timezone data in the index timezone aware, and in the same timezone as the data that was being checked.

Side note: I tried to put the data in as MST, seeing as that's the timezone it was collected in, however, this did not do what was intended. It changed the time to be 7 hours earlier than UTC time when I tried to import it like that. Additionally, even when the data was timezone aware, it still got changed to the UTC timezone during the check.

Even with all data as timezone aware in the same timezone, both the warning and TypeError are still present.

I'll update this further with what else I try to get around this error.

nkorinek commented 4 years ago

Alright we found the issue. Turns out the bug and the warning are not as related as we previously suspected. However, they are fixed by the same change to the code.

The warning was being caused by trying to convert a Pandas DataFrame column with a datetime64[ns] datatype into a numpy array. So the line in the assert_xy function in matplotcheck that was trying to convert the datetime columns into a numpy array is what was causing the warning. AKA, on line 940: np.testing.assert_array_max_ulp(np.array(xy_data["x"]), np.array(xy_expected[xcol]).

The error we were getting, TypeError: can't get common type for non-numeric array, was caused by the data type as well. The function np.testing.assert_array_max_ulp checks the datatype of the array passed in. If the datatype is not float or integer, it will auto fail. We were under the impression that if xtime=True is set, than assert_xy would autofix the dates to be integers. Upon reviewing the code, this did not seem to be the case.

So the fix is relatively straight forward. We will be adding in the following chunk of code:

if xtime and 'datetime' in str(xy_data.dtypes['x']):
    xy_data["x"] = mdates.date2num(xy_data["x"])
if xtime and 'datetime' in str(xy_expected.dtypes[xcol]):
    xy_expected[xcol] = mdates.date2num(xy_expected[xcol])

This changes the dates to integers which can than be properly checked by np.testing.assert_array_max_ulp. And since the columns are integers now, it also doesn't throw a warning while being converted to a numpy array. It also checks that the datatypes are indeed datetime objects before making the transformation.

I'll make a pr with these changes and a few tests to make sure they behave properly. @lwasser

lwasser commented 4 years ago

awesome @nkorinek thank you!!!

nkorinek commented 4 years ago

gist explaining my current dilemna: https://gist.github.com/nkorinek/accc2b905145763dc2035300a895204c

lwasser commented 4 years ago

it looks like we have a fix with this in #185 !! yay! @ryla5068 will test this fix against homework 1. and if it works, he will update the tests there that don't work now... and then we can merge this PR!! Next step will be writing tests for the time series module!!

lwasser commented 4 years ago

notes about this issue

"""" matplotlib stores datetime data in a very... unique way. It stores everything as the number of days since some epoch. If you plot a dataframe containing datetime.datetime objects or pandas.Timestamp objects, it will convert it days since epoch. Sometimes matplotlib chooses Jan 1, 1970 as the epoch. Other times it chooses Jan 1, 0001. If your data contains time data (i.e. higher precision than just dates), matplotlib will store it as fractional days since epoch, down to millisecond precision (or whatever precision your data is in). For datetime data between these epochs, sometimes it will choose to store it as negative days since 1970, other times it will store it as positive days since the year 0001.

matplotlib DOES provide functions for converting data from this weird format back to datetime.datetime or pandas.Timestamp. However, these functions always assume that the 1970 epoch was used.

matplotlib's documentation claims that negative values for datetime data are not supported, and therefore data representing dates before 1970 are not supported. However, matplotlib will happily plot data before 1970 and its conversion functions will happily accept negative numbers and try to convert them.

As you might imagine, this presents a number of issues for comparing datetime data. Most obviously, it gets unreliable when we have to guess which epoch matplotlib chose to use. We have tried a few different methods here: different ways of converting the data, converting using both epochs and comparing both, etc. All of them were pretty messy.

Additionally, there is the issue of floating point truncation error. matplotlib stores this data with numpy.float64, which has 52 mantissa bits, or about 15 base-10 digits of accuracy. Since the number of days since epoch is often in the tens-of-thousands, this means that matplotlib may not be able to accurately represent data with millisecond precision. (Basically, the datatype isn't able to store such a huge number with such small precision.) The actual available precision will depend on the dates being used and the epoch matplotlib chooses.

So to solve these problems, we have done two things:

First, we don't bother to try to convert from matplotlib's data ourselves. Instead, we require that instructors provide the expected data in matplotlib's format when using assert_xydata(). The easiest way for instructors to do this is for them to plot the data themselves, create a matplotcheck object from it, and then extract the data using get_xy(). One weird quirk is that matplotlib seems to consistently choose the same epoch when plotting the same dataset. (However, we are unable to predict which epoch this will be for a given dataset, and matplotlib's conversion functions don't always choose the same epoch as when the data is plotted.) This solves the problem of being able to convert the data.

Second, we use numpy.testing.assert_array_max_ulp() for comparing datetime data (or any other type of numeric data). This method of comparison ensures that floating-point roundoff error does not cause the assertion to erroneously fail. However, this cannot prevent truncation error, and therefore cannot prevent a loss of precision. Practically, what this means is that assert_xydata() cannot tell the difference between times with differences of tens of milliseconds. If it can't tell the difference, it will err on the side of passing.

For more info about the issues we've faced with this, take a look at PR #185 """

lwasser commented 4 years ago

this will be closed via #219 which is almost done!!

marty-larocque commented 4 years ago

@lwasser now that #219 is ready to be merged, this can be closed!

nkorinek commented 4 years ago

This can be closed @lwasser

lwasser commented 4 years ago

closing!!

earthlab / matplotcheck

Timeseries data check failing - coersion issue #169