Closed devinrsmith closed 6 months ago
Might make sense, esp. if performance benchmark shows meaningful improvement from avoided allocations (or rather, gc time). Challenges:
String
(which is probably why there's no char[]
taking alternative), so might end up being no-win for default caseThen again, if and when we are moving to default to FastDoubleParser, it could be a win. Would def want to know something about improvement tho.
Anyway, +1 for experimentation.
Yeah. At least in the case of int/long, the JDK has CharSequence
parsing support:
static int parseStringAsInt(JsonParser parser) throws IOException {
if (parser.hasTextCharacters()) {
final int len = parser.getTextLength();
final CharSequence cs = CharBuffer.wrap(parser.getTextCharacters(), parser.getTextOffset(), len);
return Integer.parseInt(cs, 0, len, 10);
} else {
return Integer.parseInt(parser.getText());
}
}
A bit surprising the JDK is missing similar methods for parseFloat
/ parseDouble
.
Would a PR adding these new methods to NumberInput
be welcome, or would you need to see performance numbers first? I'm not planning to have anything in the library call them; although, I suspect there may be opportunities to improve JsonParser#getValueAsDouble
and related in these regards.
int/long is sort of less relevant as we implement optimized decoding ourselves and do not rely on JDK (for cases that matter -- parsing int/long from String is not on critical path).
And wrapping char[]
in CharSequence
is unlikely to have much if any performance benefits: might be slower than constructing String
s.
As to adding methods... I think performance numbers would be good. Can be stand-along micro-benchmark, no need to be integrated via parser (altho obv integrated one would be the ultimate thing). It is too easy to try to add speculative improvements that have no real effect.
The reason I am slightly skeptical is that the real heavy part for double
and float
is decimal-to-binary decoding of actual numeric value; it's much more expensive than int
/long
f.ex.
Oh btw, forgot to say -- given that you do have high-performance streaming use case, this is probably quite relevant. So just to make sure: I am not against improvements like this. I just want to emphasize verifying improvements; in this case showing for your actual use case (replica thereof) would be absolutely awesome.
I actually have -- interestingly enough -- need for fast float
parsing for Vector DB use case. So this is something I could quite likely benefit from as well, now that I think about it :)
You can use https://github.com/wrandelshofer/FastDoubleParser directly - It has support for Char Array, Char Sequence and String.
We can update NumberInput too but I'd prefer if that was done with perf tests that prove that it improves Jackson performance.
I'd prefer not to delay the 2.17 release so any Jackson changes in this area are likely not to be released until 2.18 or beyond.
One thing I may have missed -- thank you @pjfanning for sort of hinting at -- NumberInput
is primarily meant to be used by Jackson itself. It is not meant as utility for code outside Jackson.
So improvements need to be something that JsonParser
implementations use.
And yeah, definitely none of changes would go in 2.17 but 2.18 or later.
I understand there is nuance in assuming "public" means "public API". Looking through some related comments though, https://github.com/rallyhealth/weePickle/issues/117#issuecomment-1350169393 @pjfanning seems to suggest that NumberInput
is part of the "public API"?
I'm very familiar with https://github.com/wrandelshofer/FastDoubleParser (we use it as part of https://github.com/deephaven/deephaven-csv), it's great :D. Good point though, I can use it directly in the meantime.
I understand there is nuance in assuming "public" means "public API". Looking through some related comments though, rallyhealth/weePickle#117 (comment) @pjfanning seems to suggest that
NumberInput
is part of the "public API"?I'm very familiar with https://github.com/wrandelshofer/FastDoubleParser (we use it as part of https://github.com/deephaven/deephaven-csv), it's great :D. Good point though, I can use it directly in the meantime.
The methods that we have in NumberInput - we can't stop you using them. We are not going to add new methods to NumberInput unless they are useful for the internal Jackson code.
Exactly, what @pjfanning said: primary user is Jackson itself, and anything that benefits it is potentially good idea. Making it more useful for other use cases is not a priority and could be considered counter-productive (due to added code to maintain, or to limit possibility of changes to API, implementation).
So, I don't consider NumberInput
to be part of Public API.
And even more importantly: embedded/shaded FastDoubleParser
is definitely not to be used by anyone directly (but acceptable via NumberInput
I guess).
Thanks for the additional context.
Ultimately, I think I'll be making a case for additional methods on JsonParser
, something like:
/**
* Assumes {@link JsonToken#VALUE_STRING}, parses the string as a <b>float</b>.
*/
public float getStringAsFloat() throws IOException;
// ...
public double getStringAsDouble() throws IOException;
// ...
public BigInteger getStringAsBigInteger() throws IOException;
// ...
public BigDecimal getStringAsBigDecimal() throws IOException;
I would assume that getValueAsFloat
/getValueAsDouble
ID_STRING
cases would delegate to respective impl.
Edit:
Probably getStringAsInt
and getStringAsLong
as well.
There are already getValueAsXxx()
for int
, long
, double
, fwtw. So naming should be followed.
Other than that I am quite ambivalent on whether supporting such use is good or not (from bigger picture perspective). But can definitely be discussed as part of proposal.
I've got some of my own quantitative numbers motivating this. I'm essentially seeing an uplift from 14mm doubles / second -> 20mm doubles / second when using ch.randelshofer.fastdoubleparser.JavaDoubleParser
from getText
to getTextCharacters
, and an order of magnitude less GC (count, total reclaimed, total GC time). This is parsing through an array of a billion strings that are Math.random() doubles. Java 21, -Xmx1g -XX:CompileCommand=inline,java/lang/String.charAt
(of course, this is on my machine with some overhead from our engine, and only representative of this particularly setup, etc):
(Sorry if you saw my previous comment which I deleted, it was incorrect.)
I think there's an opportunity to also improve the performance of JsonParser#getDoubleValue
, which seems like it always goes through String
instead of char[]
? Testing out an array of doubles (as opposed to an array of double strings), the performance matches up with my slower getText
results from above.
Essentially, I'm in a position where the JsonParser API lets me parse double strings faster than doubles, which is an odd position to be in. (Unless I'm mistaken and there is a way for me to get char[]
from VALUE_NUMBER_FLOAT
contexts.)
@devinrsmith I am 100% in favor of ensuring that getDoubleValue()
on JsonToken.VALUE_NUMBER_FLOAT
is optimal. So if and when we are missing method in NumberInput
to support that, I'd be in favor of adding a method or methods as necessary.
(for 2.18 branch once we get there)
Ok. This is awkward: I was assuming getDoubleValue()
and getFloatValue()
would invoke efficient methods in FastDoubleParser
(assuming fast processing enabled). But due to some refactoring in 2.16 or 2.15, this is no longer the case: I think deferred processing was applied too widely and as a result a String
is constructed proactively and used from thereon. I don't think that is necessary: fast method should be called from ParserBase._parseSlowFloat()
via TextBuffer _textBuffer
methods contentsAsFloat()
and contentsAsDouble()
.
(and looks like TextBuffer.contentsAsFloat()
needs to be rewritten to actually use buffered content, if any).
This should be fixed, but timeline is such that it's too late for 2.17 -- after rc1 I think there's too much risk for edge case handling to go awry (deferred number handling was added for a reason and although I hope unit test coverage is sufficient I am not confident). So I'll mark this as 2.18 thing to look into.
With #1230 now merged, we could tackle this issue.
But looking at ParserBase
, logic is rather muddy: deferral of decoding increase has essentially lead ALL FP access to go via intermediate String _numberString
which is now accurate but sub-optimal.
Conceptually, there shouldn't be a problem with more eager access: if and when JsonParser.getDoubleValue()
is called, for example, value can now be materialized as double
-- we do have other accessors that do not force evaluation, specifically JsonParser.getNumberValueDeferred()
, used for buffering.
But the trick is untangle methods the way they are done currently...
Re-created as #1284 to address performance issue; closing this one since I do not think new methods are needed, just better implementations of existing ones.
@devinrsmith Now that #1284 is complete in 2.18
, getDoubleValue()
in 2.18.0-SNAPSHOT should avoid String
allocation: and my initial testing does support this, there is a nice boost.
I also uncovered a bug in #1230 when testing and not first seeing speed up.
So... if it was possible for you to test 2.18.0-SNAPSHOT (from Sonatype snapshot repo or build locally), that'd be great.
@cowtowncoder Looks great, thanks! Quick test reading 100mm doubles out of an array, it looks like 2.18.0-SNAPSHOT parses doubles without extraneous allocation; went from ~6GiB to <100MiB.
@devinrsmith Excellent! Thank you for super fast follow-up.
I've got a high performance streaming use case where I'm handling some of the lower-level details. For example, translating a
VALUE_STRING
into adouble
. I'm relying onNumberInput
with the appropriateuseFastParser
config, and would like to avoid extraneousString
allocation when possible. TheBigDecimal
case is currently the most complete, as it works withJsonParser#hasTextCharacters
:The other similar cases aren't able to take advantage since the underlying
char[]
parsing methods aren't exposed:It would be great to add
char[]
versionsNumberInput#parseFloat
,NumberInput#parseDouble
, andNumberInput#parseBigInteger
.