UBOdin / mimir

Data-ish exploration through SQL+Uncertainty
http://mimirdb.info
Apache License 2.0
26 stars 13 forks source link

The shape detector lens is not producing sensible caveats #363

Closed okennedy closed 4 years ago

okennedy commented 4 years ago

Ok... so this was a case of reality (and me) being stupider than I'd originally expected.

The specific error being seen was based on the NYC Cause of Death test files (now in the repo under /test/NYC_CoD). These two files come from the NYS open data portal and include data for 2008-2014 and 2008-2016 respectively. When given to the shape detector:

:man_facepalming:

These errors result from a combination of several issues that should now be resolved (as of 8a376308248b9b5c75080f687808ca89b8222db9)

CAST Inconsistency

Spark's CAST operation evaluates CAST('.' AS bigint) to 0, while Mimir's evaluates it to NULL. Curiously, CAST('.' AS double) evaluates to NULL in Spark as well...

The inconsistency between Mimir and Spark needs to be fixed (#364), but it was particularly pronounced because evaluation could be shared between Mimir and Spark. Due to an old optimization: Mimir would take over evaluation of the final stages of a query, since these usually had Scala UDFs and SQLite had rather poor performance due to repeated crossing of the JVM boundary.

I simplified the compiler pipeline, removing this unnecessary optimization, and behavior of the Null-ish facets should now be more stable and in particular, users shouldn't see query results that differ based on query complexity.

Legitimate Data Errors

So... it turns out that in the 2015/2016 data dump, NYC added blank cells as missing values. These are universally interpreted as NULL by Mimir and Spark, so the 2nd dataset actually had a mix of '.'s and ''s in the DEATHS column, and actually did legitimately have nulls where the first dataset did not.

Senility

I could have sworn that there were domain facets already implemented. I could have also sworn we had an oxfordComma method in StringUtils. Apparently, I was wrong on both counts... at least until now. There are two new facet types: DrawnFromRange locks in the min/max values of sequential-typed column and DrawnFromDomain locks in the set of distinct values of any string-typed column with fewer than 20 distinct values. This now correctly handles the two test data files.