Faster division by 1000

xtonik commented 9 months ago

Division by 1000 was replaced with more performant custom method using multiply and add. I suppose that this tiny method will be inlined, otherwise the code must be copied to places where is called.

Another similar simple optimization was also done - multiplication of int by constant 0.80 in integer (exactly in long) domain.

The validity is verified by attached tests. They are provided as a comparison to original implementation and thus they don't need to be executed again and again, so they are marked as a ignored not to cause running tests slowly - on my PC they take around 10 s.

No behavior changes were done, performance improvement only.

pjfanning commented 9 months ago

@xtonik thanks for the changes but this code would need to be sigificantly faster in order to justify the extra complexity - at least for me.

cowtowncoder commented 9 months ago

@xtonik Is there some article or reference showing the idea behind technique? I can believe that using multiplication and shifting can be faster but.. how much? Is it worth the change?

Maybe you have links to other projects doing this as optimization? (I assume it's a well-known tehnique)

cowtowncoder commented 9 months ago

I think I am open to change, provided there's something published about technique (ideally it's a well-known technique that I wasn't familiar with). Let's drop 80% calculation as it's not worth optimizing.

It'd be good to have jmh-based micro-benchmark as well; maybe I'll modify

https://github.com/cowtowncoder/misc-micro-benchmarks

with it.

cowtowncoder commented 9 months ago

Ok. So a quick test with misc-microbenchmarks gives these numbers with JDK 17:

Benchmark                      Mode  Cnt  Score   Error  Units
DivBy1000.perfHandOptimized   thrpt    5  5.022 ± 0.037  ops/s
DivBy1000.perfSimpleDivision  thrpt    5  3.566 ± 0.029  ops/s

that is, +50% throughput for operation itself. With JDK 8 bit less; and for some reason I don't quite know, no speedup with JDK 21 (maybe the "unoptimized" case is actually optimized by JDK under the hood).

Whether this translates into measurable improvement for number serialization is open question; I guess it shouldn't hurt.

Also: I modified unit test a bit by moving String concatenation out of busy loop and test over full int range takes only about 2 seconds. I'll probably tweak it a bit to have automated, always run part, and separate manually run full test.

cowtowncoder commented 9 months ago

@xtonik I think we can make this happen, and although it's hard to know how meaningful for real use improvements are, change is still relatively small and tested (I modified unit test to run through 1/7 of full int range and leave on for CI).

But one thing before I can merge this is that if we don't have a CLA for you, we'd need one (just needs to be sent once before the first contribution). It's this one:

https://github.com/FasterXML/jackson/blob/master/contributor-agreement.pdf

and usually easiest to print, fill & sign, scan/photo; then email cla at fasterxml dot com. Once I get it I can proceed with last clean up and merge.

Looking forward to getting this merged, thank you in advance!

xtonik commented 9 months ago

@xtonik I think we can make this happen, and although it's hard to know how meaningful for real use improvements are, change is still relatively small and tested (I modified unit test to run through 1/7 of full int range and leave on for CI).

But one thing before I can merge this is that if we don't have a CLA for you, we'd need one (just needs to be sent once before the first contribution). It's this one:

https://github.com/FasterXML/jackson/blob/master/contributor-agreement.pdf

and usually easiest to print, fill & sign, scan/photo; then email cla at fasterxml dot com. Once I get it I can proceed with last clean up and merge.

Looking forward to getting this merged, thank you in advance!

Ok, I will subscribe it.

xtonik commented 9 months ago

Ok. So a quick test with misc-microbenchmarks gives these numbers with JDK 17:
Benchmark                      Mode  Cnt  Score   Error  Units
DivBy1000.perfHandOptimized   thrpt    5  5.022 ± 0.037  ops/s
DivBy1000.perfSimpleDivision  thrpt    5  3.566 ± 0.029  ops/s
that is, +50% throughput for operation itself. With JDK 8 bit less; and for some reason I don't quite know, no speedup with JDK 21 (maybe the "unoptimized" case is actually optimized by JDK under the hood).

Whether this translates into measurable improvement for number serialization is open question; I guess it shouldn't hurt.

Also: I modified unit test a bit by moving String concatenation out of busy loop and test over full int range takes only about 2 seconds. I'll probably tweak it a bit to have automated, always run part, and separate manually run full test.

As you said, the compiler for new Java versions does more optimizations than older ones. The same situation (not remembering from which version) is probably for String.replace(String, String), when given string is one character long, then it is replaced by String.replace(char, char). This is what is about one of my prepared PR.

xtonik commented 9 months ago

To get more familiar where the magic numbers came from, there are registered sequences A346495 and A346496 for this.

xtonik commented 8 months ago

I have just sent my CLA to given email as requested.

If someone is interested and have some time, I have prepared another changes. As they are tiny, it takes couple of minutes to check if some of them is valuable to create PRs.

FasterXML / jackson-core

Faster division by 1000 #1203