Closed timbray closed 4 months ago
Oh that's a neat find. I think we should not match as well unless the user has mentioned us how to handle higher precision digits. I will ask few folks to confirm to make sure we haven't missed anything obvious. Hoping, I can send a PR tomorrow or next week (away in between for the long weekend).
My approach was just to count the digit characters (stopping when I saw an 'e' or 'E' and if there were more than 9, refuse to generate. I probably need to do some fuzzing to see if there are any numerical patterns that will cause incorrect results. Switching to ordinary string match is going to produce the correct result almost always but there is a corner case where 1234567890.0 won't match 1234567890 because you lose the number semantic when you get outside the ComparableNumber precision. Hmm, I remember when we rolled in Ruler 2 and you guys did a ton of testing and the only changes were when people were starting to have rule {"x": [ 35 ] } match {"x": 35.0}, when it had failed before.
I'm reading https://en.wikipedia.org/wiki/Double-precision_floating-point_format (so many years since I learned this stuff) and apparently double floats offer 15 to 17 decimal digits of precision because there are 53 bits of mantissa. So what's limiting the precision isn't the use of doubles but the restriction in ComparableNumber to 14 hex digits.
So in principle I guess that ComparableNumber could be changed to be a lot more precise. 53 bits is 14 bytes i.e. 28 hex digits. Of course this could increase memory usage while adding little real-world value. If you read the comments in ComparableNumber.java there's the idea of using tricks like 32/64 radix to compress.
Not sure what the right thing to do is… I note that the numbers in CityLots have 17 digits of precision. Hmm…
Just started looking at the code. I'm naively thinking we should throw errors similar to when a number is beyond our precision range while its still a double here https://github.com/aws/event-ruler/blob/main/src/main/software/amazon/event/ruler/ComparableNumber.java#L49
currently testing a bunch of edge-cases to see if would work.
FWIW, here's the current Go code (not yet pushed) from Quamina:
func comparableFromBytes(s []byte) (comparableNumber, error) {
// first of all, refuse to try to convert if there are too many digits
digits := 0
for _, utf8Byte := range s {
if utf8Byte >= ASCII0 && utf8Byte <= ASCII9 {
digits++
}
if utf8Byte == 'e' || utf8Byte == 'E' {
break
}
}
if digits > MaxComparableDigits {
return nil, errors.New("number too long")
}
numeric, err := strconv.ParseFloat(string(s), 64)
if err != nil {
return nil, errors.New("not a float")
}
return comparableFromFloat(numeric)
}
Note that it still needs work to not count leading zeroes…
on
So what's limiting the precision isn't the use of doubles but the restriction in ComparableNumber to 14 hex digits. ... in principle I guess that ComparableNumber could be changed to be a lot more precise
I wonder this as well but didn't look at because of the benefits are low.
Yeah, there's a lot to like about the treat-as-string-if-too-precise because ComparableNumber is really designed to help match things like 35.0 vs 35 vs 3.5e1, and that kind of variation isn't seen so much in really precise numbers like the kind in CityLots.
But how about { "numeric": [ ">", 37.807807921694092, "<=", 37.807807921695 ] }
?
I haven't looked at the code but I guess that kind of pattern should be rejected, unless we're willing to increase the precision of ComparableNumber?
I'm finding a bunch of numbers across ruler with precision above 6 decimals. Checking if there's anything in old commit history if these were intentional or not :
@Test
public void testNumericExpressions() {
String[] goods = {
"[\"=\", 3.8]", "[\"=\", 0.00000033]", "[\"=\", -4e-8]", "[\"=\", 55555]",
"[\"<\", 3.8]", "[\"<\", 0.00000033]", "[\"<\", -4e-8]", "[\"<\", 55555]",
"[\">\", 3.8]", "[\">\", 0.00000033]", "[\">\", -4e-8]", "[\">\", 55555]",
"[\"<=\", 3.8]", "[\"<=\", 0.00000033]", "[\"<=\", -4e-8]", "[\"<=\", 55555]",
"[\">=\", 3.8]", "[\">=\", 0.00000033]", "[\">=\", -4e-8]", "[\">=\", 55555]",
"[\">\", 0, \"<\", 1]", "[\">=\", 0, \"<\", 1]",
"[\">\", 0, \"<=\", 1]", "[\">=\", 0, \"<=\", 1]"
};
...
and
@Test
public void WHEN_WildlyVaryingNumberFormsAreProvided_THEN_TheGeneratedStringsAreSortable() {
double[] data = {
-Constants.FIVE_BILLION, -4_999_999_999.99999, -4_999_999_999.99998, -4_999_999_999.99997,
-999999999.99999, -999999999.99, -10000, -122.413496524705309, -0.000002,
0, 0.000001, 3.8, 3.9, 11, 12, 122.415028278886751, 2.5e4, 999999999.999998, 999999999.999999,
4_999_999_999.99997, 4_999_999_999.99998, 4_999_999_999.99999, Constants.FIVE_BILLION
};
The second one seems unintentional for sure.
Yeah, pretty sure I wrote that second one and totally wasn't thinking about number-of-digits issues. The first one does look like someone was trying to explore a precision issue?
I spent quite a bit of time investigating this. I believe the code does what it says: Works correctly when the range is between +/-5e9, with <=6 points right of the decimal. This seems to work quite well in practical terms.
So I guess for Quamina I will apply those criteria. Since we have a textual version of the number, counting the digits right of the decimal point is easy.
Yup. It looks like its been there since the beginning. The tests passed until now because Ruler ignored precision. Once I added a change to fail rule-matching even when numbers with more than 6 decimals were passed, the test started to fail.
I'm thinking the change needs to be more refined. Ignore precision for comparison (less than, greater than) for backward compatibility but then for equality checks, we should fail if numbers has > 6 levels of decimals.
That ComparableNumber code was written in ~2017-18 so it took me a while to remember the thinking, which should have been written in a comment. But I think I have it now.
What the current scheme does is represent a block in the center of the X-axis, -5e9…+5e9. Those numbers have 10 decimal digits, plus we allow 6 fractional digits, so that is in good harmony with the ~16-decimal-digit precision of 64-bit floats. Thus we have a block of ~2**53 possible values that can be represented preserving order in our 14-byte/28-hex-digit representation.
You can see the problem with the numbers in CityLots: They not only have too many digits of precision (17) but, as my example above shows, if there are more than 6 fractional digits, they don't fit into the ComparableNumber constraints. Let's call these "out-of-bounds" numbers.
For out-of-bounds numbers, I think the only sensible thing to do is to treat them as strings. This will work well, because for those kinds of “unusual” numbers, e.g. crypto signatures or AWS account numbers or CityLots geometry, they are typically given precisely.
That leaves the backward-compatibility issue. In Quamina, for the out-of-bounds numbers, I think I will refuse to attempt < and > comparisons because the chances of being correct are not very good. Unless we can think of a way to represent those out-of-bounds numbers in a state machine…
I would ask: In the real world, are there instances of people doing < or > on out-of-bounds numbers? Because they are probably getting wrong answers. So either (a) they are getting wrong answers and don't notice or (b) they are reporting problems (are they?) or (c) this just isn't happening.
Here's another unit test and its output. This shows that this statement in README.md is wrong: "Numeric matching is limited to value between -5.0e9 and +5.0e9 inclusive, with 15 digits of precision, that is to say 6 digits to the right of the decimal point." +/-5.0e9 is correct, 15 digits is correct, but since 5.0e9 has 10 digits, there can only be 5 digits to the right of the decimal point. This is consistent with the data in WHEN_WildlyVaryingNumberFormsAreProvided_THEN_TheGeneratedStringsAreSortable
which all have 5 digits to the right of the decimal.
@Test
public void WHEN_SixFractionalDigits_THEN_OrderingIsLost() {
double[] lows = {-5_000_000_000.0, -4_999_999_999.999999, -4_999_999_999.999998};
double[] highs = {4_999_999_999.999998, 4_999_999_999.999999, 5_000_000_000.0};
for (double low : lows) {
String c = ComparableNumber.generate(low);
System.out.printf("%f => %s\n", low, c);
}
for (double high : highs) {
String c = ComparableNumber.generate(high);
System.out.printf("%f => %s\n", high, c);
}
}
Output:
-5000000000.000000 => 00000000000000
-4999999999.999999 => 00000000000000
-4999999999.999998 => 00000000000001
4999999999.999998 => 2386F26FC0FFFE
4999999999.999999 => 2386F26FC10000
5000000000.000000 => 2386F26FC10000
Will update the doc when I send my PR.
For out-of-bounds numbers, I think the only sensible thing to do is to treat them as strings
I'm giving myself another day to give this a go. If I can't make sufficient progress, I'll roll out a version of ruler where we throw errors to grab real-world data in shadow mode. Until then the next release will be marked as experimental.
btw @timbray have you considered using generating a longer string from the ComparableNumber.generate()
equivalent in Quamina for these higher precision numbers? I'm still learning this part of the code, so maybe I've got this all wrong but if these numbers are rare, then there should be minimal impact in real world situation if we use more memory to support them.
Yes. Even within the current 14-digit limit, notice that the biggest ComparableNumber we generate is 2386F26FC10000 - so we could increase the range by a factor of up to ~6, say to +/-20B, and still retain 5 digits right of the decimal. I don't think we could add another past-the-decimal digit without reducing the range, down to something like +/-2.5B. But I have a nice little test-bench set up now, and if you want more precise data about any proposed changes, I can get it quickly.
My honest opinion is that the current setup is hitting a good balance. But you'd know better than me. I know you can't disclose internal stuff but at the time I left Amazon, Ruler was being used by several services and the number-1 gripe was memory usage, I can't remember anyone having pain with the numeric range. But if you say you see a demand for more numeric range or precision, I'll be prepared to believe it's based on experience.
But at the moment I'd vote to leave it at +/-5B with 5 fractional digits, improve the documentation, and treat out-of-bounds numbers as strings.
As I think about this, I'm capturing what I understand in a long comment at the top of Q's numbers.go
. Here's the current state.
// You can't easily build automata to compare numbers based on either the decimal notation found
// in text data or the internal floating-point bits. Therefore, for a restricted subset of numbers,
// we define a 7-byte (14 hex digit) representation that facilitates building automata to support
// equality and ordering comparison.
//
// The representation supports 10**15 numbers. The first three are:
// decimal: -5_000_000_000, -4_999_999_999.99999, -4_999_999_999.99998, ...
// 14-byte: 00000000000000, 00000000000009, 00000000000014
// and the last three are
// decimal: .., 4_999_999_999.99998, 4_999_999_999.99999, 5_000_000_000
// 14-byte: 2386F26FC0FFEC, 2386F26FC0FFF6, 2386F26FC10000
//
// In English: all numbers that are between negative and positive 5 billion inclusive, with up to five
// digits after the decimal point.
// These numbers have fifteen decimal digits of precision, which is what double floats can offer.
// They include most numbers that are used in practice, including prices, occurrence counts, size
// measurements, and so on.
// Examples of numbers that do NOT meet these criteria include AWS account numbers, some telephone
// numbers, and cryptographic keys/signatures. For these, treatment as strings seems to produce
// satisfactory results for equality testing.
after changes in ComparableNumber.generate(...)
following two tests pass :
@Test
public void WHEN_SixFractionalDigits_THEN_OrderingIsNotLost() {
double[] lows = {-5_000_000_000.0, -4_999_999_999.999999, -4_999_999_999.999998};
double[] highs = {4_999_999_999.999998, 4_999_999_999.999999, 5_000_000_000.0};
for (int i = 1; i < lows.length; i++) {
String s0 = ComparableNumber.generate(lows[i - 1]);
String s1 = ComparableNumber.generate(lows[i]);
System.out.println("i=" + i + " s0:" + s0 + " s1:" + s1);
assertTrue(s0.compareTo(s1) < 0);
}
for (int i = 1; i < highs.length; i++) {
String s0 = ComparableNumber.generate(highs[i - 1]);
String s1 = ComparableNumber.generate(highs[i]);
System.out.println("i=" + i + " s0:" + s0 + " s1:" + s1);
assertTrue(s0.compareTo(s1) < 0);
}
}
@Test
public void WHEN_PrecisionIsIgnored_THEN_BugsEnsue() {
String badRule = "{\"x\": [ 37.807807921694092 ] }";
String[] varying = {
"37.807807921694092",
"37.80780792169409",
"37.8078079216940",
"37.807807921694",
"37.80780792169",
"37.8078079216",
"37.807807921",
"37.80780792",
"37.8078079",
"37.807807",
"37.80780",
};
for (String number : varying) {
String event = String.format("{\"x\": %s}", number);
Machine mm = new Machine();
try {
mm.addRule("R", badRule);
List<String> matches = mm.rulesForJSONEvent(event);
System.out.printf("For %s, %d matches\n", number, matches.size());
} catch (Exception e) {
System.out.println("Ouch! " + e.toString() + " = " + event);
}
}
String goodRule = "{\"x\": [ 37.80780 ] }";
for (String number : varying) {
String event = String.format("{\"x\": %s}", number);
Machine mm = new Machine();
try {
mm.addRule("R", goodRule);
List<String> matches = mm.rulesForJSONEvent(event);
System.out.printf("For %s, %d matches\n", number, matches.size());
} catch (Exception e) {
System.out.println("Ouch! " + e.toString() + " = " + event);
}
}
}
now outputs
For 37.807807921694092, 1 matches
For 37.80780792169409, 0 matches
For 37.8078079216940, 0 matches
For 37.807807921694, 0 matches
For 37.80780792169, 0 matches
For 37.8078079216, 0 matches
For 37.807807921, 0 matches
For 37.80780792, 0 matches
For 37.8078079, 0 matches
For 37.807807, 0 matches
For 37.80780, 0 matches
For 37.807807921694092, 0 matches
For 37.80780792169409, 0 matches
For 37.8078079216940, 0 matches
For 37.807807921694, 0 matches
For 37.80780792169, 0 matches
For 37.8078079216, 0 matches
For 37.807807921, 0 matches
For 37.80780792, 0 matches
For 37.8078079, 0 matches
For 37.807807, 0 matches
For 37.80780, 1 matches
Process finished with exit code 0
Testing across more edge-cases: large numbers, very small numbers, and so on.
Will be interested to see the changes to generate()
minor changes for now but I'm surprised there aren't floating point accuracy discrepancies for the 6th decimal when we multiple or divide.
static String generate(final double d) {
if (d < -Constants.FIVE_BILLION || d > Constants.FIVE_BILLION) {
throw new IllegalArgumentException("Value must be between " + -Constants.FIVE_BILLION +
" and " + Constants.FIVE_BILLION + ", inclusive");
}
if (d != Math.round(d * TEN_E_SIX) / TEN_E_SIX) {
throw new IllegalArgumentException("Only values up to 6 decimals are supported.");
}
return toHexStringSkippingFirstByte(Math.round(TEN_E_SIX * Constants.FIVE_BILLION) + Math.round(TEN_E_SIX * d));
}
Yup. Fails at -4_999_999_999.999994 and -4_999_999_999.999993 . Let me try parsing the string before the number gets turned into a double (though I’m not sure how straight forward it’ll be for octals and hex literals).
I was expecting something like that, as you were. Ummm… I would be a bit dubious about investing much more time into this, the current design feels like a good compromise to me.
The idea that does interest me was the use of something other than hex digits in ComparableNumber (Quamina too). There are several “baseX” encodings for various values of X that are way more space-efficient.
Oh I'm using this as an excuse to understand this code and then explore moving to base64. Not being familiar with this part of the code as much I didn't feel I had learned enough to go there.
Another thing I learned along the way, most of the time we serialize to a double from a string. By moving the string parsing within ComparableNumber
, we don't see as performance hit within our benchmarks even if I use BigDecimals https://github.com/aws/event-ruler/blob/imprecision/src/main/software/amazon/event/ruler/ComparableNumber.java#L57
Reading citylots2
Read 213068 events
EXACT events/sec: 210333.7
WILDCARD events/sec: 154173.7
PREFIX events/sec: 204283.8
PREFIX_EQUALS_IGNORE_CASE_RULES events/sec: 213709.1
SUFFIX events/sec: 196376.0
SUFFIX_EQUALS_IGNORE_CASE_RULES events/sec: 211167.5
EQUALS_IGNORE_CASE events/sec: 177704.8
NUMERIC events/sec: 126524.9
ANYTHING-BUT events/sec: 136319.9
ANYTHING-BUT-IGNORE-CASE events/sec: 131767.5
ANYTHING-BUT-PREFIX events/sec: 137729.8
ANYTHING-BUT-SUFFIX events/sec: 136319.9
ANYTHING-BUT-WILDCARD events/sec: 139625.2
COMPLEX_ARRAYS events/sec: 30166.8
PARTIAL_COMBO events/sec: 37400.0
COMBO events/sec: 15778.1
Reading citylots2
Read 213068 events
EXACT events/sec: 183205.5
WILDCARD events/sec: 149731.6
PREFIX events/sec: 195834.6
PREFIX_EQUALS_IGNORE_CASE_RULES events/sec: 212008.0
SUFFIX events/sec: 198572.2
SUFFIX_EQUALS_IGNORE_CASE_RULES events/sec: 196376.0
EQUALS_IGNORE_CASE events/sec: 184954.9
NUMERIC events/sec: 124674.1
ANYTHING-BUT events/sec: 125481.7
ANYTHING-BUT-IGNORE-CASE events/sec: 122947.5
ANYTHING-BUT-PREFIX events/sec: 139900.2
ANYTHING-BUT-SUFFIX events/sec: 138266.1
ANYTHING-BUT-WILDCARD events/sec: 142711.3
COMPLEX_ARRAYS events/sec: 33100.5
PARTIAL_COMBO events/sec: 50754.6
COMBO events/sec: 20495.2
Wow, good performance numbers too!
I was skeptical of the performance as quirk of runnign on my laptop but seems to run well in the sandboxed environment as well
Running software.amazon.event.ruler.Benchmarks
Reading citylots2
Read 213068 events
Finding Rules...
Lots: 10000
Lots: 20000
Lots: 30000
Lots: 40000
Lots: 50000
Lots: 60000
Lots: 70000
Lots: 80000
Lots: 90000
Lots: 100000
Lots: 110000
Lots: 120000
Lots: 130000
Lots: 140000
Lots: 150000
Lots: 160000
Lots: 170000
Lots: 180000
Lots: 190000
Lots: 200000
Lots: 210000
Lines: 213068, Msec: 11812
Events/sec: 18038.3
Rules/sec: 126267.9
Reading citylots2
Read 213068 events
EXACT events/sec: 266668.3
WILDCARD events/sec: 183679.3
PREFIX events/sec: 259522.5
PREFIX_EQUALS_IGNORE_CASE_RULES events/sec: 257328.5
SUFFIX events/sec: 260793.1
SUFFIX_EQUALS_IGNORE_CASE_RULES events/sec: 267002.5
EQUALS_IGNORE_CASE events/sec: 235955.7
NUMERIC events/sec: 145837.1
ANYTHING-BUT events/sec: 141950.7
ANYTHING-BUT-IGNORE-CASE events/sec: 143383.6
ANYTHING-BUT-PREFIX events/sec: 153839.7
ANYTHING-BUT-SUFFIX events/sec: 152409.2
ANYTHING-BUT-WILDCARD events/sec: 162275.7
COMPLEX_ARRAYS events/sec: 32529.5
PARTIAL_COMBO events/sec: 50204.5
COMBO events/sec: 20232.5
I've made one controversial choice within https://github.com/aws/event-ruler/pull/166 to fallback to the old way for hexadecimal numbers primarily because it used to work that way before and is tied to the way we convert these numbers to double. It's an extreme edge-case, so I think we'd be okay but once the library has merged the change, it'll be easier to confirm under shadow mode.
Thanks for creating this one @timbray. Was the perfect excuse to learn enough so that I can now play with higher radix numbers in a follow-up change. Closing this one for now.
In fact, the precision of ComparableNumber is about 9 decimal digits. Ignoring this produces results that I think are wrong. In Quamina, I changed the equivalent of
ComparableNumber.generate()
to refuse to generate if there are too many digits. This has the effect that for super long numbers like in CityLots, they all get matched as strings, which produces correct results.Here's the test:
It prints out