abolz / Drachennest

Different algorithms for converting binary to decimal floating-point numbers
Boost Software License 1.0
116 stars 7 forks source link

You might be interested in Grisu-Exact #2

Open jk-jeon opened 4 years ago

jk-jeon commented 4 years ago

Hello, I'm also the author of yet another floating-point format algorithm which I call Grisu-Exact (https://github.com/jk-jeon/Grisu-Exact), which originally started as a variant of Grisu but eventually evolved into something more than that.

It seems your implementations are super fast and I'm very impressed. Especially, I didn't know there already was an algorithm with the shortest and correct rounding guarantees, yet is faster than Grisu-Exact (Schubfach! I have no idea why Ryu got much more attention when Schubfach already have existed). I've run a dtoa benchmark of your implementation of Schubfach and Ulf's implementation of Ryu together with my implementation of Grisu-Exact, and the result looks like the following. digits_benchmark_binary64_with_schubfach uniform_benchmark_binary64_with_schubfach (The benchmark might be not fair in the sense that Ryu and Grisu-Exact's string outputs are always in scientific format while yours seem to generate the shorter one between the fixed-point and scientific formats, though I'm not sure how much it might affect. My string generation routine is actually a copy of that of Ryu.)

So Schubfach is indeed faster than Grisu-Exact!

I would be very happy if you take a look at my algorithm and possibly add it into your benchmark.

I'm curious of several things:

  1. Is there any key idea that enables your Grisu2 to be that crazy fast? Or is that a result of lots of small but cute tricks?
  2. Does your implementation of Schubfach do something more than what the paper explains (in terms of performance optimizations), or does it pretty much just follow the paper in a straightforward manner?

Thanks!

abolz commented 4 years ago

This looks great! Thanks for letting me know!

Actually, this repository started out as an implementation of the Girsu2 algorithm. I have since tried to modify Grisu2 such that it always produces the shortest and correctly rounded output. But I failed miserably... Great to know someone else succeeded! =) So now this repository is intended to collect some other implementations and provide some benchmarks.

I will definitely have a look at your implementation! And I will definitely add Grisu-Exact to the benchmarks!

That being said, as you might have noticed, the bench directory currently does not contain the actual code I have used to produce the benchmarks... (And the Grisu2 implementation I have used for the benchmark, is a slightly modified version of the one in the src directory...) I'll try to find some time this weekend to work on this, so that anyone else can verify the results and compare different implementation. And to get the code up to date.

  1. My Grisu2 implementation (the one which was used for the benchmarks) differs from the "usual" implementations in so far that the digits are generated as an unsigned integer, instead of ASCII digits. It then uses uses the same formatting procedure as the Ryu and Schubfach implementations. Another difference is the digit generation loop. It tries to generate 4, 3, 2, or 1, digit per loop. (I need to take a closer look into your implementation, but it seems that you're doing something very similar?) And it uses a slightly greater table to do so.

  2. Schubfach indeed is amazing! The Schubfach implementation pretty much follows the paper. I'm using different constants though and slightly different approximations for the logarithms. The different constants enable some minor optimizations compares to the Java implementation in the paper. (Warning: all tests pass, but I don't have a complete proof that the different constants always work. The same is true for the Ryu implementation actually: Using slightly different constants allows to merge two code-paths in Ryu.)

  3. The formatting procedure is actually cheating a little bit :-) I have chosen an output format specifically to print small integers as quickly as possible. So, for small integers the digit generation procedures might output integers which are divisible by 10. The digit generation procedure then does not need to remove these trailing zeros which the formatting procedure would need to add back. Another optimization makes use of the buffer size. Using a somewhat larger buffer removes the need to call memcpy with a variable length. So all the dtoa implementations currently require an output buffer with a length >= 64 and up to 64 characters might actually be (temorarily) written to that buffer.

I have found that the header-only implementations usually result in slower code. Even if only used in a single .cpp file to generate the dtoa implementation. I'll try to update the implementations this weekend and provide the non-header-only version of Grisu2 which I have used for the benchmark.

I would be very happy if you could provide a PR with your Grisu-Exact implementation, which uses the same formatting procedures as the Ryu and Schubfach implementations.

Thanks again for letting me know! Great work!

PS: I have found that always generating the output in scientific format doesn't really make much of difference. So I think that your benchmarks are fair. But because my formatting procedure is cheating a little bit, it is quite possible that your Grisu-Exact implementation is actually somewhat faster than Schubfach =)

jk-jeon commented 4 years ago

I would be very happy if you could provide a PR with your Grisu-Exact implementation, which uses the same formatting procedures as the Ryu and Schubfach implementations.

No prob. It will not take too long to modify my implementation, except that I've never committed a PR and I'm not familiar with how things work so maybe I need some trial and errors.

Also, I'm actually planning to re-work on a new implementation of (maybe a variant of) Grisu2 based on lots of optimization tricks I came up with when working on Grisu-Exact. It currently is a low-priority thing to do for me right now, though. It would be interesting to compare it with your Grisu2 implementation if I somehow get to do it :)

FYI: there is another very fast Grisu2 implementation here: https://erthink.github.io/dtoa-benchmark/ I didn't compare it with mine, but according to what the author advertises, it seems the performance is comparable to that of Grisu-Exact. So I think your implementation should be faster than this, but might be worth running a benchmark with yours.

That being said, as you might have noticed, the bench directory currently does not contain the actual code I have used to produce the benchmarks... (And the Grisu2 implementation I have used for the benchmark, is a slightly modified version of the one in the src directory...) I'll try to find some time this weekend to work on this, so that anyone else can verify the results and compare different implementation. And to get the code up to date.

So maybe it is better for me to wait for you to work out before making a PR?

My Grisu2 implementation (the one which was used for the benchmarks) differs from the "usual" implementations in so far that the digits are generated as an unsigned integer, instead of ASCII digits. It then uses uses the same formatting procedure as the Ryu and Schubfach implementations. Another difference is the digit generation loop. It tries to generate 4, 3, 2, or 1, digit per loop. (I need to take a closer look into your implementation, but it seems that you're doing something very similar?) And it uses a slightly greater table to do so.

I skimmed over your implementation of Grisu2 and wow, the main idea looks quite similar indeed. Probably the biggest difference is that I chose alpha and gamma to be very close to each other in order to confine the search range of kappa (which of course unfortunately is the reason why I need a huge cache table). Thanks to that, I could completely remove the loop because the maximum number of required comparisons is small (5, for the case of double). Also, because of this, all integer divisions can be eliminated as all the divisors are now compile-time constants.

Another big difference is that I choose alpha and gamma to be fairly large in order to work mostly with integers rather than fractions. But it is interesting that with small alpha and gamma it is still possible to implement a trick (that 4, 3, 2, 1 thing) similar to what I did. Probably I need to look at yours more closely. Might be possible to find out some ideas that can be helpful to me.

Schubfach indeed is amazing! The Schubfach implementation pretty much follows the paper. I'm using different constants though and slightly different approximations for the logarithms. The different constants enable some minor optimizations compares to the Java implementation in the paper. (Warning: all tests pass, but I don't have a complete proof that the different constants always work. The same is true for the Ryu implementation actually: Using slightly different constants allows to merge two code-paths in Ryu.)

Let me ask you to clarify a bit more. So, by constants, do you mean the constants used for the approximations of the logarithms, or something else?

The formatting procedure is actually cheating a little bit :-) I have chosen an output format specifically to print small integers as quickly as possible. So, for small integers the digit generation procedures might output integers which are divisible by 10. The digit generation procedure then does not need to remove these trailing zeros which the formatting procedure would need to add back. Another optimization makes use of the buffer size. Using a somewhat larger buffer removes the need to call memcpy with a variable length. So all the dtoa implementations currently require an output buffer with a length >= 64 and up to 64 characters might actually be (temorarily) written to that buffer.

I honestly didn't clearly understand this. Could you elaborate more?

Also, I think by "formatting procedure" you mean generation of a human-readable string, and by "digit generation procedure" you mean finding a decimal representation of the floating-point number. Am I correct? (I guess perhaps these terminologies are coming from that once upon a time your implementation of Grisu2 actually did generate some sort of human-readable string directly, and then an additional "formatting procedure" polished it?)

I have found that the header-only implementations usually result in slower code. Even if only used in a single .cpp file to generate the dtoa implementation. I'll try to update the implementations this weekend and provide the non-header-only version of Grisu2 which I have used for the benchmark.

That sounds weird to me. I think header-only version should be usually faster (except when it isn't) because the compiler has more information for optimization.

Thanks for having a look at my work, and thanks for your great work as well!

abolz commented 4 years ago

The formatting procedure is actually cheating a little bit :-) I have chosen an output format specifically to print small integers as quickly as possible. So, for small integers the digit generation procedures might output integers which are divisible by 10. The digit generation procedure then does not need to remove these trailing zeros which the formatting procedure would need to add back. Another optimization makes use of the buffer size. Using a somewhat larger buffer removes the need to call memcpy with a variable length. So all the dtoa implementations currently require an output buffer with a length >= 64 and up to 64 characters might actually be (temorarily) written to that buffer.

I honestly didn't clearly understand this. Could you elaborate more?

Yeah, "digit generation procedure" is a very bad name to describe to core routine which converts from binary floating-point to decimal floating-point. Ryu, Schubfach, and Grisu-Exact, all compute a decimal floating-point in the form significand * 10^exponent and output the two integers significand and exponent. The "formatting procedure" then takes the pair (significand, exponent) as an input and converts it to the final output string, either in fixed or scientific form. (I hope this clarifies the unfortunate terminology a bit...)

Now, the algorithms detect if the input is a small integer in the range [1, 2^53] and in this case (the "fast path") simply return the significand and an exponent = 0. The significand here might contain trailing (decimal) zeros, though. E.g. for the input "1e+3" the output without the fast path would be (significand, exponent) = (1,3). With this optimization the output will be (1000, 0). This of course violates the "shortest output" condition. But it is ok if the "formatting procedure" will print all these small integers in fixed-point notation. Otherwise one would need to remove these trailing 0s in the "fast path", making the "fast path" somewhat slower.

The other thing is that the "formatting procedure" sometimes needs to insert a decimal point somewhere between the digits. Using a somewhat larger buffer, this functionality can be implemented slightly more efficiently.

Does this make more sense now?

jk-jeon commented 4 years ago

Yub. Thanks for the explanation. It does not really look like a cheating to me though. If I really need to call it a cheating, I would say the digit generation procedure is a bit cheating rather than the formatting procedure :)

abolz commented 4 years ago

Let me ask you to clarify a bit more. So, by constants, do you mean the constants used for the approximations of the logarithms, or something else?

Yes, the constants used for the approximations of the logarithms. But also (and probably more importantly) the cached powers-of-ten. As you know, these are approximated in the form 10^k ~= f * 10^e, where 2^bits <= f < 2^(bits+1) is an integer. The difference is how f is obtained. Write 10^k == x * 10^e, where x is a real number 2^bits <= x < 2^(bits+1). The paper then uses f = floor(x) + 1 (Ryu also uses f = floor(x) if k < 0 - or if k > 0 - I don't remember right now...). My implementation always uses f = ceil(x).

This allows to merge two different code paths in the Ryu implementation.

For the Schubfach implementation this is used only because I once wanted to reuse the cache from the Ryu implementation. But I should probably use the constants as defined in Schubfach paper, since the implementations currently are completely independent.

The problem is: I do not have a complete proof, whether it is actually allowed to always use f = ceil(x) ;-)

I just quickly skimmed through your Grisu-Exact paper and, of course, you know the details already. Proving that ceiling always works should be a (minor?) modification to the proof in section 4 of your paper?

abolz commented 4 years ago

That being said, as you might have noticed, the bench directory currently does not contain the actual code I have used to produce the benchmarks... (And the Grisu2 implementation I have used for the benchmark, is a slightly modified version of the one in the src directory...) I'll try to find some time this weekend to work on this, so that anyone else can verify the results and compare different implementation. And to get the code up to date.

So maybe it is better for me to wait for you to work out before making a PR?

I just started to import your Grisu-Exact version. I have removed some C++20 constructs, to get the code running. I have some failing tests, though. This is because Grisu-Exact seems to generate a decimal significand with 18 decimal digits in some cases. (I need to double check that my modifications are correct...)

Here is a number which generates 18 decimal digits (using my modified implementation): 1.4411518807585586e+17.

Can you verify this? Or am I doing something wrong?

EDIT:

@jk-jeon Nevermind... Grisu-Exact is working fine... was my mistake.

jk-jeon commented 4 years ago

I just started to import your Grisu-Exact version. I have removed some C++20 constructs, to get the code running.

Oh god, I just realized that std::remove_cvref_t<T> is a C++20 constructs. I'll replace it with std::remove_cv_t<std::remove_reference_t<T>>. Is there anything else?

As you know, these are approximated in the form 10^k ~= f * 10^e, where 2^bits <= f < 2^(bits+1) is an integer.

I think you mean 10^k ~= f * 2^e.

The problem is: I do not have a complete proof, whether it is actually allowed to always use f = ceil(x) ;-)

Could you elaborate on how you can possibly merge two paths by doing that? (Frankly, I did not look at Ryu so carefully except for the min-max Euclid algorithm which is also relevant to mine, so this might be a dumb question; please forgive me if it is😏)

My first impression is that it should not be a surprise if f = ceil(x) would fail. Anyway, I can imagine an algorithm to verify the validity, applying the min-max Euclid algorithm.

Since your x is always a non-integer (except for some trivial cases), floor and ceil should differ by 1. Hence, the case e2>=0 is never a problem. (Lemma 3.3 in Ulf's paper; Lemma 4.1 in my paper). So we only need to look at the case e2<0. (Lemma 3.4 in Ulf's paper; Lemma 4.2 in my paper). Now, following the notation of Lemma 3.4 in Ulf's paper, let us write

w * floor(5^(-e2-q)/2^k) = s * 2^(q-k) + t

for some 0 <= t < 2^(q-k). (I believe q=floor(-e2 * log5(2)) in the paper should be a typo. I think q=floor(-e2 * log10(5)) should be the correct one.)

Now, by construction, the maximum possible value of w is (maybe) 2^55-2. Hence, we need to check if the remainder t cannot exceed 2^(q-k) - 2^55 + 1 when w varies from 0 to 2^55-2. The min-max Euclid algorithm can be used here. Technically speaking the algorithm itself only works when floor(5^(-e2-q)/2^k) is an odd number, but it is not a big problem because we can factor out all powers of 2 from floor(5^(-e2-q)/2^k) before running the algorithm and modify the bound on t appropriately. A proper modification of min-max Euclid algorithm should also give us a way to compute a counterexample if any. Then we can directly test those counterexamples if the whole algorithm still works for them. (But I think it will be more difficult to come up with a way to compute all counterexamples, so even if the test passes it would not mean that the whole algorithm is correct, if there is at least one counterexample.)

@jk-jeon Nevermind... Grisu-Exact is working fine... was my mistake.

Phew! it was a close call!

abolz commented 4 years ago

First of all, sorry for the delay, but I have been quite busy the last weeks...

Could you elaborate on how you can possibly merge two paths by doing that? (Frankly, I did not look at Ryu so carefully except for the min-max Euclid algorithm which is also relevant to mine, so this might be a dumb question; please forgive me if it is😏)

That's not a dumb question at all. I had in mind that adjusting the constants was neccessary to merge the two code paths (there are two paths in the upstream implementation where similar products of the form x * 5^m / 2^n are computed, one for e2 >= 0, one for e2 < 0). But I was wrong. The constants now match the ones defined in the Ryu paper. And I have a much better feeling now ;-) So thanks for asking. (Maybe it didn't matter because I was using 128 bits for the constants, instead of the minimum of 125 bits... But I don't know.)

I have found a bug in the Schubfach implementation (in previously unused code, though) and have commited a few other minor improvements.

I will add a Grisu-Exact benchmark as soon as possible. Here are some preliminary results. The benchmark only measures the binary to decimal conversion (no to-string conversion, no formatting) for numbers of the form N * 10^E, where N is an integer with len(N) decimal digits.

Grisu-Exact: grisu_exact_dec_64

Ryu: ryu_dec_64

Grisu-Exact and Ryu look quite similar in this particular benchmark. Grisu-Exact looks slightly better, though, because it has a smaller worst-case runtime than Ryu and seems to be slightly faster for small integers. (The benchmark has been run without the "small integer optimization" we talked about earlier.)

Thanks again for letting me know! Great work!

jk-jeon commented 4 years ago

Grisu-Exact and Ryu look quite similar in this particular benchmark. Grisu-Exact looks slightly better, though, because it has a smaller worst-case runtime than Ryu and seems to be slightly faster for small integers. (The benchmark has been run without the "small integer optimization" we talked about earlier.)

It seems Ryu performs better for long digits cases, though? Given your picture, it seems Ryu might have better performance for uniformly random data.

I made some improvement in last several weeks, so please update to the recent one.

I've also run a new benchmark against your recent Schubfach implementation. This time I used your to-string implementation, so I guess it is more fair than the previous benchmark. Here is the result:

Uniformly random data (binary32): uniform_binary32

Random data with given number of digits (binary32): digit_binary32

Uniformly random data (binary64): uniform_binary64

Random data with given number of digits (binary64): digit_binary64

So, for binary32, your Schubfach clearly wins. For binary64, Grisu-Exact seems to perform better for some small digit cases, but overall Schubfach wins.

Scubfach is great, and your work is great!

jk-jeon commented 4 years ago

Wow, you implemented Dragonbox! Your optimization is really impressive, I couldn't expect it can be made 5-10ns faster! Yeah, I remember you said your digit formatting routine is a bit of cheating, but it's still great:) Thanks for working on it.

Small question: it seems you didn't implement optimized divisions by 1000 and 100. Did you have a benchmark and concluded that it doesn't worth implementing them, or just skipped them for other reasons?

abolz commented 4 years ago

Wow, you implemented Dragonbox!

Well, it is still your implementation. I have just removed a bit of code :-) And possibly introduced some bugs ;-)

it seems you didn't implement optimized divisions by 1000 and 100

Any decent compiler already optimizes divisions by compile-time constants quite well. And I wanted the code to be as clean as possible, because I didn't even read your paper yet ;-) I do not have any actual benchmarks.

Now Dragonbox really is amazing! Formatting double-precision floating-point numbers in JavaScript-style in ~15-25ns is just... wow! Thanks for your great work!!

I ran out of time the last weekend, but I hope to find some more time the next weekend to open a PR in your repository with the optimized formatting-procedure. (I understand that you're focused on the binary->decimal part, but I think it would still be a nice addition to your library.)

jk-jeon commented 4 years ago

Any decent compiler already optimizes divisions by compile-time constants quite well. And I wanted the code to be as clean as possible, because I didn't even read your paper yet ;-) I do not have any actual benchmarks.

It is simply impossible to implement division-by-1000 as one multiplication plus one shift, but in our case it is possible because we are not using the full range of 64-bit integers. Compilers these days can't figure out such a thing.

For the case of division-by-100, what I've done is to compute the quotient and check divisibility with one multiplication (rather than two, which is the usual case if we compute both the quotient and the remainder). This was also possible because we are only using very limited range of numbers. I'm pretty sure compilers can never do that. But this also introduces some branches and some logical operations so it might end up being slower. When I first came up with this optimization, I got some nice speed-up, but I don't know it will still be the case for the final code we have now.

Note that in modern x86 machines 32-bit multiplication is indeed very fast (but it's still slower than additions, for example), but that's only the case for what I call half multiplication in my paper; that is, computing the lower 32-bits only and discard the high 32-bits.. But in order to perform the division it is usually necessary to perform full multiplication, that is, it is necessary to compute the high 32-bits, thus it just ends up performing a 64-bit multiplication, which is horribly slow. So I believe manual optimization of the division will certainly be beneficial here. What's I'm not sure about is if what I've done is optimal. It can be that checking the divisibility by just performing a plain 32-bit multiplication is indeed not a big deal.

I ran out of time the last weekend, but I hope to find some more time the next weekend to open a PR in your repository with the optimized formatting-procedure. (I understand that you're focused on the binary->decimal part, but I think it would still be a nice addition to your library.)

Such a PR would be great!!

If you have time, I want you to consider contributing on optimizing fmt's floating-point formatting routine. fmt has adapted Dragonbox recently, and it is currently cutting out trailing zeros at the binary-to-decimal conversion stage with my method, which seems to be a lot slower than your method ;) fmt's string generation procedure is indeed intertwined with lots of other higher-level formatting stuffs so it should be harder than just writing your own, but it would be a valuable contribution.

abolz commented 4 years ago

It is simply impossible to implement division-by-1000 as one multiplication plus one shift, but in our case it is possible because we are not using the full range of 64-bit integers. Compilers these days can't figure out such a thing.

You're right. The usual technique uses two shifts and one multiplication. So your technique might indeed be a little bit fast.

For the case of division-by-100, ...

I've tried your approach, but it seems to be slightly slower (at least in my slightly modified implementation). Since we're working with 32-bit integers here, the standard method only uses one 64-bit and one 32-bit multiplication. Both operations are fast on 64-bit platforms. I have not tested this, but I suspect your method to be (much) faster on 32-bit platforms.

Such a PR would be great!!

This is still on my to-do list, but it may take while. Now if you have the time, you could just copy the FormatDigits implementation from the schubfach.cpp's into your to_chars implementation. It is probably easier for you to do this, than for me, since you certainly know your code better ;-)

jk-jeon commented 4 years ago

I've tried your approach, but it seems to be slightly slower (at least in my slightly modified implementation). Since we're working with 32-bit integers here, the standard method only uses one 64-bit and one 32-bit multiplication. Both operations are fast on 64-bit platforms.

Well, that's interesting, because my impression is that 64-bit multiplication is indeed slow, in a way that the overall performance is mainly determined by the number of 64-bit multiplications the whole procedure performs. If I recall correctly, it was indeed about 2 times slower than the 32-bit counterpart in modern x86-64 machines. The thing is, we can at least replace the 64-bit multiplication into a 32-bit one because the range of numbers is very limited (which is something the compiler has no idea about), and I believe it will certainly boost the performance. What I was not sure about is whether it is worth or not introducing a branch and bunch of other operations in order to eliminate yet another 32-bit multiplication for the remainder calculation.

Anyway, I was just curious if you have measured the time difference. There is nothing wrong about your code I believe. Thanks for answering!