Add char[] versions for NumberInput parseFloat, parseDouble, parseBigInteger

devinrsmith commented 8 months ago

I've got a high performance streaming use case where I'm handling some of the lower-level details. For example, translating a VALUE_STRING into a double. I'm relying on NumberInput with the appropriate useFastParser config, and would like to avoid extraneous String allocation when possible. The BigDecimal case is currently the most complete, as it works with JsonParser#hasTextCharacters:

static BigDecimal parseStringAsBigDecimal(JsonParser parser) throws IOException {
    final boolean useFastParser = parser.isEnabled(StreamReadFeature.USE_FAST_BIG_NUMBER_PARSER);
    return parser.hasTextCharacters()
            ? NumberInput.parseBigDecimal(parser.getTextCharacters(), parser.getTextOffset(), parser.getTextLength(), useFastParser)
            : NumberInput.parseBigDecimal(parser.getText(), useFastParser);
}

The other similar cases aren't able to take advantage since the underlying char[] parsing methods aren't exposed:

static float parseStringAsFloat(JsonParser parser) throws IOException {
    // TODO: improve when parser.hasTextCharacters()
    return NumberInput.parseFloat(parser.getText(), parser.isEnabled(StreamReadFeature.USE_FAST_DOUBLE_PARSER));
}

static double parseStringAsDouble(JsonParser parser) throws IOException {
    // TODO: improve when parser.hasTextCharacters()
    return NumberInput.parseDouble(parser.getText(), parser.isEnabled(StreamReadFeature.USE_FAST_DOUBLE_PARSER));
}

static BigInteger parseStringAsBigInteger(JsonParser parser) throws IOException {
    // TODO: improve when parser.hasTextCharacters()
    return NumberInput.parseBigInteger(parser.getText(), parser.isEnabled(StreamReadFeature.USE_FAST_BIG_NUMBER_PARSER));
}

It would be great to add char[] versions NumberInput#parseFloat, NumberInput#parseDouble, and NumberInput#parseBigInteger.

cowtowncoder commented 8 months ago

Might make sense, esp. if performance benchmark shows meaningful improvement from avoided allocations (or rather, gc time). Challenges:

JDK implementation only takes String (which is probably why there's no char[] taking alternative), so might end up being no-win for default case
Number of permutations wrt fast/JDK parser (number of methods we have with overloads)

Then again, if and when we are moving to default to FastDoubleParser, it could be a win. Would def want to know something about improvement tho.

Anyway, +1 for experimentation.

devinrsmith commented 8 months ago

Yeah. At least in the case of int/long, the JDK has CharSequence parsing support:

static int parseStringAsInt(JsonParser parser) throws IOException {
    if (parser.hasTextCharacters()) {
        final int len = parser.getTextLength();
        final CharSequence cs = CharBuffer.wrap(parser.getTextCharacters(), parser.getTextOffset(), len);
        return Integer.parseInt(cs, 0, len, 10);
    } else {
        return Integer.parseInt(parser.getText());
    }
}

A bit surprising the JDK is missing similar methods for parseFloat / parseDouble.

Would a PR adding these new methods to NumberInput be welcome, or would you need to see performance numbers first? I'm not planning to have anything in the library call them; although, I suspect there may be opportunities to improve JsonParser#getValueAsDouble and related in these regards.

cowtowncoder commented 8 months ago

int/long is sort of less relevant as we implement optimized decoding ourselves and do not rely on JDK (for cases that matter -- parsing int/long from String is not on critical path). And wrapping char[] in CharSequence is unlikely to have much if any performance benefits: might be slower than constructing Strings.

As to adding methods... I think performance numbers would be good. Can be stand-along micro-benchmark, no need to be integrated via parser (altho obv integrated one would be the ultimate thing). It is too easy to try to add speculative improvements that have no real effect.

The reason I am slightly skeptical is that the real heavy part for double and float is decimal-to-binary decoding of actual numeric value; it's much more expensive than int/long f.ex.

cowtowncoder commented 8 months ago

Oh btw, forgot to say -- given that you do have high-performance streaming use case, this is probably quite relevant. So just to make sure: I am not against improvements like this. I just want to emphasize verifying improvements; in this case showing for your actual use case (replica thereof) would be absolutely awesome.

I actually have -- interestingly enough -- need for fast float parsing for Vector DB use case. So this is something I could quite likely benefit from as well, now that I think about it :)

pjfanning commented 8 months ago

You can use https://github.com/wrandelshofer/FastDoubleParser directly - It has support for Char Array, Char Sequence and String.

We can update NumberInput too but I'd prefer if that was done with perf tests that prove that it improves Jackson performance.

I'd prefer not to delay the 2.17 release so any Jackson changes in this area are likely not to be released until 2.18 or beyond.

cowtowncoder commented 8 months ago

One thing I may have missed -- thank you @pjfanning for sort of hinting at -- NumberInput is primarily meant to be used by Jackson itself. It is not meant as utility for code outside Jackson. So improvements need to be something that JsonParser implementations use.

And yeah, definitely none of changes would go in 2.17 but 2.18 or later.

devinrsmith commented 8 months ago

I understand there is nuance in assuming "public" means "public API". Looking through some related comments though, https://github.com/rallyhealth/weePickle/issues/117#issuecomment-1350169393 @pjfanning seems to suggest that NumberInput is part of the "public API"?

I'm very familiar with https://github.com/wrandelshofer/FastDoubleParser (we use it as part of https://github.com/deephaven/deephaven-csv), it's great :D. Good point though, I can use it directly in the meantime.

pjfanning commented 8 months ago

I understand there is nuance in assuming "public" means "public API". Looking through some related comments though, rallyhealth/weePickle#117 (comment) @pjfanning seems to suggest that NumberInput is part of the "public API"?

I'm very familiar with https://github.com/wrandelshofer/FastDoubleParser (we use it as part of https://github.com/deephaven/deephaven-csv), it's great :D. Good point though, I can use it directly in the meantime.

The methods that we have in NumberInput - we can't stop you using them. We are not going to add new methods to NumberInput unless they are useful for the internal Jackson code.

cowtowncoder commented 8 months ago

Exactly, what @pjfanning said: primary user is Jackson itself, and anything that benefits it is potentially good idea. Making it more useful for other use cases is not a priority and could be considered counter-productive (due to added code to maintain, or to limit possibility of changes to API, implementation).

So, I don't consider NumberInput to be part of Public API.

And even more importantly: embedded/shaded FastDoubleParser is definitely not to be used by anyone directly (but acceptable via NumberInput I guess).

devinrsmith commented 8 months ago

Thanks for the additional context.

Ultimately, I think I'll be making a case for additional methods on JsonParser, something like:

/**
 * Assumes {@link JsonToken#VALUE_STRING}, parses the string as a <b>float</b>.
 */
public float getStringAsFloat() throws IOException;

// ...
public double getStringAsDouble() throws IOException;

// ...
public BigInteger getStringAsBigInteger() throws IOException;

// ...
public BigDecimal getStringAsBigDecimal() throws IOException;

I would assume that getValueAsFloat/getValueAsDouble ID_STRING cases would delegate to respective impl.

Edit: Probably getStringAsInt and getStringAsLong as well.

cowtowncoder commented 8 months ago

There are already getValueAsXxx() for int, long, double, fwtw. So naming should be followed. Other than that I am quite ambivalent on whether supporting such use is good or not (from bigger picture perspective). But can definitely be discussed as part of proposal.

devinrsmith commented 8 months ago

I've got some of my own quantitative numbers motivating this. I'm essentially seeing an uplift from 14mm doubles / second -> 20mm doubles / second when using ch.randelshofer.fastdoubleparser.JavaDoubleParser from getText to getTextCharacters, and an order of magnitude less GC (count, total reclaimed, total GC time). This is parsing through an array of a billion strings that are Math.random() doubles. Java 21, -Xmx1g -XX:CompileCommand=inline,java/lang/String.charAt (of course, this is on my machine with some overhead from our engine, and only representative of this particularly setup, etc):

getText

getTextCharacters

devinrsmith commented 8 months ago

(Sorry if you saw my previous comment which I deleted, it was incorrect.)

I think there's an opportunity to also improve the performance of JsonParser#getDoubleValue, which seems like it always goes through String instead of char[]? Testing out an array of doubles (as opposed to an array of double strings), the performance matches up with my slower getText results from above.

Essentially, I'm in a position where the JsonParser API lets me parse double strings faster than doubles, which is an odd position to be in. (Unless I'm mistaken and there is a way for me to get char[] from VALUE_NUMBER_FLOAT contexts.)

cowtowncoder commented 8 months ago

@devinrsmith I am 100% in favor of ensuring that getDoubleValue() on JsonToken.VALUE_NUMBER_FLOAT is optimal. So if and when we are missing method in NumberInput to support that, I'd be in favor of adding a method or methods as necessary.

(for 2.18 branch once we get there)

cowtowncoder commented 8 months ago

Ok. This is awkward: I was assuming getDoubleValue() and getFloatValue() would invoke efficient methods in FastDoubleParser (assuming fast processing enabled). But due to some refactoring in 2.16 or 2.15, this is no longer the case: I think deferred processing was applied too widely and as a result a String is constructed proactively and used from thereon. I don't think that is necessary: fast method should be called from ParserBase._parseSlowFloat() via TextBuffer _textBuffer methods contentsAsFloat() and contentsAsDouble(). (and looks like TextBuffer.contentsAsFloat() needs to be rewritten to actually use buffered content, if any).

This should be fixed, but timeline is such that it's too late for 2.17 -- after rc1 I think there's too much risk for edge case handling to go awry (deferred number handling was added for a reason and although I hope unit test coverage is sufficient I am not confident). So I'll mark this as 2.18 thing to look into.

cowtowncoder commented 6 months ago

With #1230 now merged, we could tackle this issue.

But looking at ParserBase, logic is rather muddy: deferral of decoding increase has essentially lead ALL FP access to go via intermediate String _numberString which is now accurate but sub-optimal.

Conceptually, there shouldn't be a problem with more eager access: if and when JsonParser.getDoubleValue() is called, for example, value can now be materialized as double -- we do have other accessors that do not force evaluation, specifically JsonParser.getNumberValueDeferred(), used for buffering. But the trick is untangle methods the way they are done currently...

cowtowncoder commented 6 months ago

Re-created as #1284 to address performance issue; closing this one since I do not think new methods are needed, just better implementations of existing ones.

cowtowncoder commented 4 months ago

@devinrsmith Now that #1284 is complete in 2.18, getDoubleValue() in 2.18.0-SNAPSHOT should avoid String allocation: and my initial testing does support this, there is a nice boost. I also uncovered a bug in #1230 when testing and not first seeing speed up.

So... if it was possible for you to test 2.18.0-SNAPSHOT (from Sonatype snapshot repo or build locally), that'd be great.

devinrsmith commented 4 months ago

@cowtowncoder Looks great, thanks! Quick test reading 100mm doubles out of an array, it looks like 2.18.0-SNAPSHOT parses doubles without extraneous allocation; went from ~6GiB to <100MiB. Screenshot from 2024-06-26 19-18-28

cowtowncoder commented 4 months ago

@devinrsmith Excellent! Thank you for super fast follow-up.

FasterXML / jackson-core

Add char[] versions for NumberInput parseFloat, parseDouble, parseBigInteger #1229