couchbaselabs / gojsonsm

Go implementation of my JSONSM algorithm.
9 stars 7 forks source link

MB-36088 - implement NaN float comparisons #119

Closed nelio2k closed 4 years ago

nelio2k commented 5 years ago

This PR started off as to address NaN comparisons and then became larger and grander in scale. It now addresses how gojsonsm works with different data types, how implicit conversion (if there is a need) works, and also better defined collation as well. The documentation explaining the though process and details can be found here: https://docs.google.com/document/d/1aoN0eAS0S88-N3KYwHjVA5dW6IXXwW_KKNidQS5IG6I/edit?pli=1#heading=h.yqp9lhieqak3

nelio2k commented 4 years ago

Based on some discussion about what is the right thing to do about NaN, as well as the complicated job of documenting gojsonsm behavior for the public when mixed type comparisons are used, I thought that it may be time to revisit the "validity" comparison argument.

The last commit push is to help make the matcher a bit more deterministic in its result. The idea is to mirror math.NaN comparisons that "if it doesn't make sense, don't return a deterministic true or false, but just fail the match". To do this, we'll just simply set binTreeState to Resolved, instead of true or false (which is subjected to NOT/NEOR ops).

With the proposed changes, the following unit tests need to be changed to align with the proposed behavior.

Test 1:

func TestMatcherMissingNotEquals(t *testing.T) {
    // This tests a specific case where a missing value should actually
    // result in a truthy result in the expression.  Due to the nature
    // of our bintree implementation, this needs special handling.
    runJSONExprMatchTest(t, `
        ["not",
          ["equals",
          ["field", "sometimesValue"],
          ["value", "true"]
        ]
      ]
    `, []string{
        "5b47eb0950e9076fc0aecd52",
        "5b47eb093771f06ced629663",
        "5b47eb09ffac5a6ce37042e7",
        "5b47eb095c3ad73b9925f7f8",
        "5b47eb0962222a37d066e231",
        "5b47eb09996a4154c35b2f98",
        "5b47eb091f57571d3c3b1aa1",
        "5b47eb098eee4b4c4330ec64",
    })
}

Document 5b47eb0950e9076fc0aecd52 originally was marked as expected to return “true” in the final output. However, the person document below actually does have “sometimesValue” as a key, and has a boolean as a type:

{
      "_id": "5b47eb0950e9076fc0aecd52",
…
      "sometimesValue": true,
}

Previously, the matcher was returning “true” for this match because:

  1. It first takes the original expression of “true”
  2. It then tries to convert the boolean from the document, true, into a Json string for comparison.
  3. Fastval’s ToJsonString() returns an error, but compareStrings() did not check for such error.
  4. It then does a string comparison between “true” with a nil string, which is NOT equal, and thus this document match returned a “true” result.

Given the proposed change, this matcher is now returning a “false” for this match because:

  1. It first takes the original expression of “true”
  2. It then tries to convert the boolean from the document, true, into a Json string for comparison.
  3. This comparison is invalid because we cannot coerce a booleantlype into a string festival.
  4. The invalid op flag pops all the way up, and thus the match result is false.

Test 2:

func TestMatcherDisparateTypeEquals(t *testing.T) {
    // TODO(brett19): Should probably discuss whether type-cast equals
    // actually makes sense... This validates that these something like:
    //  (true == "thisShouldBeABoolean") === true
    // which may not actually make a whole lot of sense...
    runJSONExprMatchTest(t, `
        ["equals",
        ["field", "sometimesValue"],
        ["value", "thisShouldBeABoolean"]
      ]
    `, []string{
        "5b47eb0936ff92a567a0307e",
        "5b47eb096b1d911c0b9492fb",
    })
}

Similar match logic as above. Comparing a string “thisShouldBeABoolean” to an actual boolean (true) in the documents doesn’t really make sense (as noted in the inline comment)

The proposed behavior is definitive: Gojsonsm will try to parse the data as a string to do string comparison… but it’s not possible to convert a boolean to a string. So the match behavior is undefined and will ensure that this match def does not pass.

As such, the above test is removed as a part of the PR.

nelio2k commented 4 years ago

@brett19 wondering if you had a chance to take a look?

brett19 commented 4 years ago

Hey @nelio2k , Having the matcher always count a failed comparison as false means that in some cases where customers are looking for a negated equality with the assumption that it not being equal (no matter the type) that it continues, they are going to get some confusing behaviour. I think we need to come up with a specific set of comparison semantics that are relatively simple to understand but that can cover all the cases. There are many examples of these in other DBs. Cheers, Brett

nelio2k commented 4 years ago

Oracle seems to have the data type comparison listed here: https://docs.oracle.com/en/database/oracle/oracle-database/18/sqlrf/Data-Type-Comparison-Rules.html#GUID-4C49C87F-F170-43CC-9EDC-2403576610DF

Some interesting things are:

Oracle Database automatically converts a value from one data type to another when such a conversion makes sense.

And:

Implicit data type conversion can have a negative impact on performance, especially if the data type of a column value is converted to that of a constant rather than the other way around.

Understandably, testing for negated equality between different data types may be consisted of a valid use case. I think it can be a special case on top of what's currently in this PR. With that said, this PR may not be the right place to address all possible implicit type conversion so that they can be compared. But I do want to take this opportunity and make it consistent, with respect to adding the valid flags and all that.

In any case, I think we can document mismatching data type comparisons like so (and I will make some changes in this PR to ensure the following behavior):

First, implicitly convert RHS datatype to the LHS datatype to do comparison. If the comparison is invalid, and equality (or match) was requested, return not equal/match. Otherwise, convert the LHS datatype to RHS and perform the comparison again.

Implicit conversion between types are listed below:

Numerical to String - attempt to convert string to source numerical value for comparison. If not possible, invalid operation.
String to Numerical - Attempt to convert string to target numerical value for comparison. If not possible, invalid operation.
Boolean to String - Boolean values will convert to string literal of "true" and "false".
String to Boolean - Valid if string is of the constant of case-insensitive “true” or “false”, with true > false if value comparison operation is requested. Invalid otherwise.
Boolean to Int/Uint - Valid, with "true" converting to 1 and "false" converting to 0;  integer comparison then takes place.
Int/Uint to Boolean - Valid with (anything != 0) becoming true and > 0 or false becoming 0; true > false.
Boolean to Float - Valid, with "true" converting to 1.0 and "false" converting to 0.0; float comparison then takes place.
Float to Boolean - Valid with (anything != 0.0) becoming true and false becoming 0; true > false.
Null-Value comparison - When a null value is involved with comparison with non-null value, null comparisons are performed, with non-null > null.

We’ll keep the data type comparisons for the rest. https://github.com/couchbaselabs/gojsonsm/blob/8a68e74c0f26db05e9611caf5256e9c9b1310d9a/fastval.go#L474-L480 One thing to be smarter about for the data type comparison is we can test for equality not just on data type equality, but the output String() of the datatype and do string comparison.

For invalid operation: Equality will return false, so negate of equality will return true. Any other non-equality comparison will return false. Negate of any non-equality comparison will also return false. (i.e. NaN)

I think the above should be understandable and documentable, and doesn’t really impact our performance.

nelio2k commented 4 years ago

Since toy-builds need gerrit change-IDs, most of the performance can be measured using the existing benchmark tests. In a nutshell, performance degradation should be negligible.

neil.huang@NeilsMacbookPro:~/source/couchbase/godeps/src/github.com/couchbase/gojsonsm$ git checkout master
Switched to branch 'master'
Your branch is up to date with 'origin/master'.
neil.huang@NeilsMacbookPro:~/source/couchbase/godeps/src/github.com/couchbase/gojsonsm$ go test -tags=perf -bench=.
goos: darwin
goarch: amd64
pkg: github.com/couchbase/gojsonsm
BenchmarkBinTree-8              20000000            66.4 ns/op
BenchmarkMatcher-8                  2000        806812 ns/op    1216.35 MB/s
BenchmarkSlowMatcher-8               100      18635258 ns/op      52.66 MB/s
BenchmarkParseString-8          100000000           10.4 ns/op
BenchmarkParseBigString-8       100000000           10.4 ns/op
BenchmarkParseEscString-8       30000000            49.8 ns/op
BenchmarkParseBigEscString-8    10000000           165 ns/op
BenchmarkParseInteger-8         100000000           11.6 ns/op
BenchmarkParseNumber-8          30000000            46.4 ns/op
BenchmarkParseNull-8            200000000            6.98 ns/op
BenchmarkParseTrue-8            200000000            6.93 ns/op
BenchmarkParseFalse-8           200000000            8.11 ns/op
BenchmarkTokenize-8                30000         43966 ns/op     355.41 MB/s

And

neil.huang@NeilsMacbookPro:~/source/couchbase/godeps/src/github.com/couchbase/gojsonsm$ git checkout MB-36088
Switched to branch 'MB-36088'
neil.huang@NeilsMacbookPro:~/source/couchbase/godeps/src/github.com/couchbase/gojsonsm$ go test -tags=perf -bench=.
goos: darwin
goarch: amd64
pkg: github.com/couchbase/gojsonsm
BenchmarkBinTree-8              20000000            67.1 ns/op
BenchmarkMatcher-8                  2000        833491 ns/op    1177.41 MB/s
BenchmarkSlowMatcher-8               100      19491300 ns/op      50.35 MB/s
BenchmarkParseString-8          100000000           11.1 ns/op
BenchmarkParseBigString-8       100000000           11.7 ns/op
BenchmarkParseEscString-8       30000000            51.0 ns/op
BenchmarkParseBigEscString-8    10000000           168 ns/op
BenchmarkParseInteger-8         100000000           12.5 ns/op
BenchmarkParseNumber-8          30000000            47.5 ns/op
BenchmarkParseNull-8            200000000            8.30 ns/op
BenchmarkParseTrue-8            200000000            8.22 ns/op
BenchmarkParseFalse-8           200000000            8.24 ns/op
BenchmarkTokenize-8                30000         44242 ns/op     353.19 MB/s