FasterXML / jackson-core

Core part of Jackson that defines Streaming API as well as basic shared abstractions
Apache License 2.0
2.26k stars 792 forks source link

BigInteger parsing in NumberInput can only support '1e10' format when fast parser mode is not used #986

Closed pjfanning closed 1 year ago

pjfanning commented 1 year ago

NumberInput.parseBigInteger uses BigDecimalParser when the string is large (over 1250 chars). BigDecimalParser can parse nums in '1e10' format and you can convert the BigDecimal to a BigInteger.

When the string size is less than 1250 chars, new BigInteger is used. This will fail for strings in '1e10' format. BigInteger does not support e notation directly.

If you enable FastDoubleParser support, we use its JavaBigIntegerParser regardless of string size. This rejects '1e10' format in order to match the behaviour of BigInteger class.

We should decide whether we want to ban '1e10' format for BigInteger in all cases (which is probably the case in jackson v2.14 and earlier) or if we want to press on and support '1e10' format in all cases. Might be better to ban the format - meaning we could remove the new StreamReadConstraints support for big int scale control.

pjfanning commented 1 year ago

@cowtowncoder apologies to be a pain but on further investigation, maybe we should rework the BigInteger support and not allow the parsing of any string that contains an 'e'/'E'.

cowtowncoder commented 1 year ago

I am not sure I see why supporting e-format directly for BigIntegers should be necessary: handling of JSON Integers and Floating-point numbers is quite clearly isolated. Former does not accept decimal-point or exponents. This does not mean BigInteger could not be bound but that is separate coercion.

Conversely I do agree that we should prevent parsing in case of NumberInput.parseBigInteger. I don't think it would ever be used for JSON Number case -- since that pre-validates basically numbers; one with engineering notation takes JsonToken.VALUE_NUMBER_FLOAT route and never tries to parse as BigInteger. But there may be case(s) when coercing (at databind level) from JSON String value into BigInteger? If so, that case perhaps should use a Regex to.... hmmh. I was about to say to fail, but from end user POV it should probably rather decode as BigDecimal, then coerce. At the same time... I am not sure how much effort I want to add for supporting misguided usage of "numbers as String"? (... counterpoint: due to Javascript stupidity, this is not a fringe case)

pjfanning commented 1 year ago

The issue is that we inadvertently started supporting '1e10' format for BigInteger when I introduced a performance enhancement that allows numbers with strings of more than 1250 chars to be first parsed as BigDecimalParser (which has enhancements to speed up parsing of large strings). But BigDecimal allows '1e10' format. The recent issues we've had we toBigInteger stem from this perf change. The '1e10' format support was never intended - it was an unfortunate side effect of the perf change.

cowtowncoder commented 1 year ago

@pjfanning I am also thinking that since limit of 1250 characters is used on using BigDecimalParser -- basically, never, not just by defaults but since few would increase limit I think -- we could simplify codebase by just completely removing that use case? Not a big win but still.

cowtowncoder commented 1 year ago

@pjfanning Understood. So how about just eliminating this code path now, before release?

pjfanning commented 1 year ago

The change that led to this is https://github.com/FasterXML/jackson-core/pull/826 - my preference is not so much to remove this but that if we have more than 1250 chars, that we also then check for 'e' or 'E' in the string and fail if they appear - before parsing as BigDecimal. Is this ok?

I have definitely found that using BigDecimal parsing for large strings is better than using new BigInteger. I know that 1250 > 1000 (the default limit for chars in numbers). It could be argued that the 1250 optimisation is no longer that useful.

So I'm not dead set against removing this code path altogether - just I have a preference to keep it but add the extra check to disallow 'e' notation for BigIntegers.

cowtowncoder commented 1 year ago

@pjfanning I am ok with validation too if you think this optimization is still valid and potentially useful. We could also lower the threshold if there's evidence to support speed up occurs at shorter lengths too, but I am guessing you did some testing to reach this value (or maybe saw a reference suggesting cut-off point)

pjfanning commented 1 year ago

If we go as far as removing the 1250 code path, do we also remove the validateBigIntegerScale code too? It seems less useful if we do remove the 150 code path.

cowtowncoder commented 1 year ago

@pjfanning No I think that check is needed for pretty much everything, where 1e99999 is decoded as FP number but then coerced into BigInteger due to target value in POJO is BigInteger. 1250 code path is not really being used by most of tests and reproduction did not rely on it.

pjfanning commented 1 year ago

With https://github.com/FasterXML/jackson-core/pull/987, scientific notation ('1e10' format) is no longer supported.

pjfanning commented 1 year ago

Users who need faster Biginteger parsing in jackson 21.5, you should consider https://github.com/FasterXML/jackson-core/pull/851