Closed lwasser closed 4 years ago
Update: This bug is very complicated.
The initial suspicion that this was due to NA values in the dataset turned out to be incorrect. I've test the dataset against itself, and there are no NA values in the dataset to start with. This bug still comes up when there are no NA values in the datasets and the datasets are identical.
The next guess that was looked into was related to how the data is transformed during the assert_xydata
check. It was noticed that the datetime objects were not timezone aware, i.e. the datatype was datetime64[ns]
, which turned into datetime64[ns, UTC]
during the check. Seeing as the error was TypeError: can't get common type for non-numeric array
, it seemed this may be the culprit. When importing the data, I added the line dataframe.index = dataframe.index.tz_localize('UTC')
to the data right after the import was done. This made the timezone data in the index timezone aware, and in the same timezone as the data that was being checked.
Side note: I tried to put the data in as MST, seeing as that's the timezone it was collected in, however, this did not do what was intended. It changed the time to be 7 hours earlier than UTC time when I tried to import it like that. Additionally, even when the data was timezone aware, it still got changed to the UTC timezone during the check.
Even with all data as timezone aware in the same timezone, both the warning and TypeError are still present.
I'll update this further with what else I try to get around this error.
Alright we found the issue. Turns out the bug and the warning are not as related as we previously suspected. However, they are fixed by the same change to the code.
The warning was being caused by trying to convert a Pandas DataFrame column with a datetime64[ns]
datatype into a numpy array. So the line in the assert_xy
function in matplotcheck that was trying to convert the datetime columns into a numpy array is what was causing the warning. AKA, on line 940: np.testing.assert_array_max_ulp(np.array(xy_data["x"]), np.array(xy_expected[xcol])
.
The error we were getting, TypeError: can't get common type for non-numeric array
, was caused by the data type as well. The function np.testing.assert_array_max_ulp
checks the datatype of the array passed in. If the datatype is not float or integer, it will auto fail. We were under the impression that if xtime=True
is set, than assert_xy
would autofix the dates to be integers. Upon reviewing the code, this did not seem to be the case.
So the fix is relatively straight forward. We will be adding in the following chunk of code:
if xtime and 'datetime' in str(xy_data.dtypes['x']):
xy_data["x"] = mdates.date2num(xy_data["x"])
if xtime and 'datetime' in str(xy_expected.dtypes[xcol]):
xy_expected[xcol] = mdates.date2num(xy_expected[xcol])
This changes the dates to integers which can than be properly checked by np.testing.assert_array_max_ulp
. And since the columns are integers now, it also doesn't throw a warning while being converted to a numpy array. It also checks that the datatypes are indeed datetime objects before making the transformation.
I'll make a pr with these changes and a few tests to make sure they behave properly. @lwasser
awesome @nkorinek thank you!!!
gist explaining my current dilemna: https://gist.github.com/nkorinek/accc2b905145763dc2035300a895204c
it looks like we have a fix with this in #185 !! yay! @ryla5068 will test this fix against homework 1. and if it works, he will update the tests there that don't work now... and then we can merge this PR!! Next step will be writing tests for the time series module!!
notes about this issue
"""" matplotlib stores datetime data in a very... unique way. It stores everything as the number of days since some epoch. If you plot a dataframe containing datetime.datetime objects or pandas.Timestamp objects, it will convert it days since epoch. Sometimes matplotlib chooses Jan 1, 1970 as the epoch. Other times it chooses Jan 1, 0001. If your data contains time data (i.e. higher precision than just dates), matplotlib will store it as fractional days since epoch, down to millisecond precision (or whatever precision your data is in). For datetime data between these epochs, sometimes it will choose to store it as negative days since 1970, other times it will store it as positive days since the year 0001.
matplotlib DOES provide functions for converting data from this weird format back to datetime.datetime or pandas.Timestamp. However, these functions always assume that the 1970 epoch was used.
matplotlib's documentation claims that negative values for datetime data are not supported, and therefore data representing dates before 1970 are not supported. However, matplotlib will happily plot data before 1970 and its conversion functions will happily accept negative numbers and try to convert them.
As you might imagine, this presents a number of issues for comparing datetime data. Most obviously, it gets unreliable when we have to guess which epoch matplotlib chose to use. We have tried a few different methods here: different ways of converting the data, converting using both epochs and comparing both, etc. All of them were pretty messy.
Additionally, there is the issue of floating point truncation error. matplotlib stores this data with numpy.float64, which has 52 mantissa bits, or about 15 base-10 digits of accuracy. Since the number of days since epoch is often in the tens-of-thousands, this means that matplotlib may not be able to accurately represent data with millisecond precision. (Basically, the datatype isn't able to store such a huge number with such small precision.) The actual available precision will depend on the dates being used and the epoch matplotlib chooses.
So to solve these problems, we have done two things:
First, we don't bother to try to convert from matplotlib's data ourselves. Instead, we require that instructors provide the expected data in matplotlib's format when using assert_xydata(). The easiest way for instructors to do this is for them to plot the data themselves, create a matplotcheck object from it, and then extract the data using get_xy(). One weird quirk is that matplotlib seems to consistently choose the same epoch when plotting the same dataset. (However, we are unable to predict which epoch this will be for a given dataset, and matplotlib's conversion functions don't always choose the same epoch as when the data is plotted.) This solves the problem of being able to convert the data.
Second, we use numpy.testing.assert_array_max_ulp() for comparing datetime data (or any other type of numeric data). This method of comparison ensures that floating-point roundoff error does not cause the assertion to erroneously fail. However, this cannot prevent truncation error, and therefore cannot prevent a loss of precision. Practically, what this means is that assert_xydata() cannot tell the difference between times with differences of tens of milliseconds. If it can't tell the difference, it will err on the side of passing.
For more info about the issues we've faced with this, take a look at PR #185 """
this will be closed via #219 which is almost done!!
@lwasser now that #219 is ready to be merged, this can be closed!
This can be closed @lwasser
closing!!
TEST
Output error: i suspect this has something to do with NAN values but i am not sure