Closed nelio2k closed 4 years ago
Based on some discussion about what is the right thing to do about NaN, as well as the complicated job of documenting gojsonsm behavior for the public when mixed type comparisons are used, I thought that it may be time to revisit the "validity" comparison argument.
The last commit push is to help make the matcher a bit more deterministic in its result. The idea is to mirror math.NaN comparisons that "if it doesn't make sense, don't return a deterministic true or false, but just fail the match". To do this, we'll just simply set binTreeState to Resolved, instead of true or false (which is subjected to NOT/NEOR ops).
With the proposed changes, the following unit tests need to be changed to align with the proposed behavior.
Test 1:
func TestMatcherMissingNotEquals(t *testing.T) {
// This tests a specific case where a missing value should actually
// result in a truthy result in the expression. Due to the nature
// of our bintree implementation, this needs special handling.
runJSONExprMatchTest(t, `
["not",
["equals",
["field", "sometimesValue"],
["value", "true"]
]
]
`, []string{
"5b47eb0950e9076fc0aecd52",
"5b47eb093771f06ced629663",
"5b47eb09ffac5a6ce37042e7",
"5b47eb095c3ad73b9925f7f8",
"5b47eb0962222a37d066e231",
"5b47eb09996a4154c35b2f98",
"5b47eb091f57571d3c3b1aa1",
"5b47eb098eee4b4c4330ec64",
})
}
Document 5b47eb0950e9076fc0aecd52 originally was marked as expected to return “true” in the final output. However, the person document below actually does have “sometimesValue” as a key, and has a boolean as a type:
{
"_id": "5b47eb0950e9076fc0aecd52",
…
"sometimesValue": true,
}
Previously, the matcher was returning “true” for this match because:
Given the proposed change, this matcher is now returning a “false” for this match because:
Test 2:
func TestMatcherDisparateTypeEquals(t *testing.T) {
// TODO(brett19): Should probably discuss whether type-cast equals
// actually makes sense... This validates that these something like:
// (true == "thisShouldBeABoolean") === true
// which may not actually make a whole lot of sense...
runJSONExprMatchTest(t, `
["equals",
["field", "sometimesValue"],
["value", "thisShouldBeABoolean"]
]
`, []string{
"5b47eb0936ff92a567a0307e",
"5b47eb096b1d911c0b9492fb",
})
}
Similar match logic as above. Comparing a string “thisShouldBeABoolean” to an actual boolean (true) in the documents doesn’t really make sense (as noted in the inline comment)
The proposed behavior is definitive: Gojsonsm will try to parse the data as a string to do string comparison… but it’s not possible to convert a boolean to a string. So the match behavior is undefined and will ensure that this match def does not pass.
As such, the above test is removed as a part of the PR.
@brett19 wondering if you had a chance to take a look?
Hey @nelio2k , Having the matcher always count a failed comparison as false means that in some cases where customers are looking for a negated equality with the assumption that it not being equal (no matter the type) that it continues, they are going to get some confusing behaviour. I think we need to come up with a specific set of comparison semantics that are relatively simple to understand but that can cover all the cases. There are many examples of these in other DBs. Cheers, Brett
Oracle seems to have the data type comparison listed here: https://docs.oracle.com/en/database/oracle/oracle-database/18/sqlrf/Data-Type-Comparison-Rules.html#GUID-4C49C87F-F170-43CC-9EDC-2403576610DF
Some interesting things are:
Oracle Database automatically converts a value from one data type to another when such a conversion makes sense.
And:
Implicit data type conversion can have a negative impact on performance, especially if the data type of a column value is converted to that of a constant rather than the other way around.
Understandably, testing for negated equality between different data types may be consisted of a valid use case. I think it can be a special case on top of what's currently in this PR. With that said, this PR may not be the right place to address all possible implicit type conversion so that they can be compared. But I do want to take this opportunity and make it consistent, with respect to adding the valid flags and all that.
In any case, I think we can document mismatching data type comparisons like so (and I will make some changes in this PR to ensure the following behavior):
First, implicitly convert RHS datatype to the LHS datatype to do comparison. If the comparison is invalid, and equality (or match) was requested, return not equal/match. Otherwise, convert the LHS datatype to RHS and perform the comparison again.
Implicit conversion between types are listed below:
Numerical to String - attempt to convert string to source numerical value for comparison. If not possible, invalid operation.
String to Numerical - Attempt to convert string to target numerical value for comparison. If not possible, invalid operation.
Boolean to String - Boolean values will convert to string literal of "true" and "false".
String to Boolean - Valid if string is of the constant of case-insensitive “true” or “false”, with true > false if value comparison operation is requested. Invalid otherwise.
Boolean to Int/Uint - Valid, with "true" converting to 1 and "false" converting to 0; integer comparison then takes place.
Int/Uint to Boolean - Valid with (anything != 0) becoming true and > 0 or false becoming 0; true > false.
Boolean to Float - Valid, with "true" converting to 1.0 and "false" converting to 0.0; float comparison then takes place.
Float to Boolean - Valid with (anything != 0.0) becoming true and false becoming 0; true > false.
Null-Value comparison - When a null value is involved with comparison with non-null value, null comparisons are performed, with non-null > null.
We’ll keep the data type comparisons for the rest. https://github.com/couchbaselabs/gojsonsm/blob/8a68e74c0f26db05e9611caf5256e9c9b1310d9a/fastval.go#L474-L480 One thing to be smarter about for the data type comparison is we can test for equality not just on data type equality, but the output String() of the datatype and do string comparison.
For invalid operation: Equality will return false, so negate of equality will return true. Any other non-equality comparison will return false. Negate of any non-equality comparison will also return false. (i.e. NaN)
I think the above should be understandable and documentable, and doesn’t really impact our performance.
Since toy-builds need gerrit change-IDs, most of the performance can be measured using the existing benchmark tests. In a nutshell, performance degradation should be negligible.
neil.huang@NeilsMacbookPro:~/source/couchbase/godeps/src/github.com/couchbase/gojsonsm$ git checkout master
Switched to branch 'master'
Your branch is up to date with 'origin/master'.
neil.huang@NeilsMacbookPro:~/source/couchbase/godeps/src/github.com/couchbase/gojsonsm$ go test -tags=perf -bench=.
goos: darwin
goarch: amd64
pkg: github.com/couchbase/gojsonsm
BenchmarkBinTree-8 20000000 66.4 ns/op
BenchmarkMatcher-8 2000 806812 ns/op 1216.35 MB/s
BenchmarkSlowMatcher-8 100 18635258 ns/op 52.66 MB/s
BenchmarkParseString-8 100000000 10.4 ns/op
BenchmarkParseBigString-8 100000000 10.4 ns/op
BenchmarkParseEscString-8 30000000 49.8 ns/op
BenchmarkParseBigEscString-8 10000000 165 ns/op
BenchmarkParseInteger-8 100000000 11.6 ns/op
BenchmarkParseNumber-8 30000000 46.4 ns/op
BenchmarkParseNull-8 200000000 6.98 ns/op
BenchmarkParseTrue-8 200000000 6.93 ns/op
BenchmarkParseFalse-8 200000000 8.11 ns/op
BenchmarkTokenize-8 30000 43966 ns/op 355.41 MB/s
And
neil.huang@NeilsMacbookPro:~/source/couchbase/godeps/src/github.com/couchbase/gojsonsm$ git checkout MB-36088
Switched to branch 'MB-36088'
neil.huang@NeilsMacbookPro:~/source/couchbase/godeps/src/github.com/couchbase/gojsonsm$ go test -tags=perf -bench=.
goos: darwin
goarch: amd64
pkg: github.com/couchbase/gojsonsm
BenchmarkBinTree-8 20000000 67.1 ns/op
BenchmarkMatcher-8 2000 833491 ns/op 1177.41 MB/s
BenchmarkSlowMatcher-8 100 19491300 ns/op 50.35 MB/s
BenchmarkParseString-8 100000000 11.1 ns/op
BenchmarkParseBigString-8 100000000 11.7 ns/op
BenchmarkParseEscString-8 30000000 51.0 ns/op
BenchmarkParseBigEscString-8 10000000 168 ns/op
BenchmarkParseInteger-8 100000000 12.5 ns/op
BenchmarkParseNumber-8 30000000 47.5 ns/op
BenchmarkParseNull-8 200000000 8.30 ns/op
BenchmarkParseTrue-8 200000000 8.22 ns/op
BenchmarkParseFalse-8 200000000 8.24 ns/op
BenchmarkTokenize-8 30000 44242 ns/op 353.19 MB/s
This PR started off as to address NaN comparisons and then became larger and grander in scale. It now addresses how gojsonsm works with different data types, how implicit conversion (if there is a need) works, and also better defined collation as well. The documentation explaining the though process and details can be found here: https://docs.google.com/document/d/1aoN0eAS0S88-N3KYwHjVA5dW6IXXwW_KKNidQS5IG6I/edit?pli=1#heading=h.yqp9lhieqak3