Open stevengj opened 9 years ago
One thing I don't like about decNumber is that it seems to use pass-by-reference for everything, which seems silly for passing and returning 32- and 64-bit values. It may not make much performance difference (since the operations themselves are so expensive), but will make the Julia glue code significantly uglier.
Well, you've got the source, does the ICU license allow you to modify that? I don't think there is much danger of getting out of sync with the decNumber code ... there doesn't seem to have been any changes in years... (which, in this case, is a good thing, it is very stable, well tested) I think the issue may have been that this code has probably been around for long enough, and back 20 years ago they (like me) had to deal with pure 32-bit platforms and compilers (ILP32), and no 64-bit long long support... so for 64-bit & 128-bit values they had to use pass-by-reference... Otherwise, you might want a C wrapper around the library, to deal with those, at least combine the things into a single structure passed in from Julia.
@stevengj I'm very willing to help out in getting decNumber to work in Julia, I was going to do it myself, but I don't yet have your mad skills in writing Julia code... I was very impressed by the way you generated all of the functions for the binding... cool!
The ICU license certainly allows the code to be modified; I wouldn't consider using it otherwise. But we are talking about a lot of functions, so I'd be hesitant to go through and change it to pass-by-value unless the process could be automated.
It's not too terrible to use pass-by-reference in Julia. For speed (to avoid heap-allocating Ref
objects), I'd want to preallocate globals length-1 arrays to use for the return values. (This is how e.g. the base Julia library gets its return values from Fortran Bessel-function routines, which are inevitably pass-by-reference.)
(Yes, the code-generation features of Julia make it a lot easier to wrap these kinds of libraries, because there are zillions of functions with very similar call signatures.)
I think we'll be needing this soon, so I'll definitely be testing this package ASAP... (it seems that love of Julia is spreading quickly at the startup I'm consulting for :grinning: I think shortly some other people from the company will be contributing, some core work, improving some packages such as ODBC, and possibly adding some new packages)
"back 20 years ago they (like me) had to deal with pure 32-bit platforms"
I'm confused.. The decimal types didn't exist then(?) Are we talking about similar libraries? Yes, decimal is not new. The packed format is I thought. And, yes, the non-packed version is used, but I thought that came after the packed one.
I've never quite understood how both the packed and non-packed are supposed to be supported. Implementations defined? Do you just stick with either one (non-packed)? Dealing with the out-side-world must, files using these bittypes or APIs, must factor in..
For now we need not worry about actual hardware (using packed) in PowerPC (right?) while Julia support for it is worked on or for SPARC.
@PallHaraldsson Back then, with the 32-bit machines, the memory model was ILP32, i.e. int, long, and pointer types were all 32-bit, and often the C compilers did not yet support structure passing... so if you wanted to return 64-bit or 128-bit types, which this package does, you had to do so by passing a reference. That is what makes the Julia wrapper more difficult, as @stevengj commented on. Both formats are part of the standard, AFAIK. The packed format is better for when you have hardware support, the "non packed" (or binary integer significand) format is better for a software library... which is the case we are dealing with... Both are useful to have available... for example if you are talking to a platform that is trying to send you data in the packed form, you might want to read it in, and convert it to the other form for calculations.
Scott
Yes, my feeling is that for the packed format we should only implement conversions, not arithmetic, so from Julia's perspective it is a storage-only format.
@ScottPJones, I programmed a lot of 32-bit machines, but I'm having trouble recalling a compiler which did not support passing or returning struct
by value by the 1990s. Am I suppressing dark memories? However, it could be that they used an inefficient calling convention in that case.
My suspicion is that decNumber
probably used pass-by-reference in order to be callable by older Fortran code (prior to compiler support for pass-by-value in the Fortran 2003 standard). (The Intel library has an option for this too.)
Anyway, whatever the reason, we need to deal with it to use decNumber
. If one of you could just try calling it from C in a quick benchmark, e.g. adding 10000 numbers, in comparison to the Intel library, that would be great. It would be good to know what we're dealing with.
@stevengj I started programming in the early 70's... (I was in 6th grade... taking courses at the state university), so I have a LOT of dark memories I'd like to suppress! [Think stacks of punched cards, written in Fortran 1966, and later in PL/1]. About C, it wasn't until sometime in mid '90s, that we were finally able to move to ANSI C completely, after waiting 5 years after the standard came out until the last platform that we actively developed for (as opposed to maintanence / bug fixing)... I used a lot of macro tricks to take advantage of ANSI C wherever I could, but that wasn't always easy. The decNumber package has very old roots... that's why I think it had to do with pre-ANSI C.
About the packed format, yes, that's exactly what I'd thought, when I was starting to look at writing a binding for it for Julia... I do also want the arbitrary precision support, not really for doing calculations, but just for parsing and storing numbers coming from JSON and later outputting them...
@stevengj I'll try to take time off from annoying Stefan about numeric literal inconsistencies and string performance issues ;-) and download the Intel library, figure out out to use it, and add up 10000 numbers!
Right now, I do not have time (or 0.4..) but I was going to test this with numbers of similar scale. I saw some very high worst case latency in a C implementation. I assume if the scale is the same there must be some fast path.. (similar in DEC64, I was thinking of doing similar fast path stuff if not) if all you are dealing with is say currency, that would be all that matters?
2015-05-01 22:40 GMT+00:00 Scott P. Jones notifications@github.com:
@stevengj https://github.com/stevengj I'll try to take time off from annoying Stefan about numeric literal inconsistencies and string performance issues ;-) and download the Intel library, figure out out to use it, and add up 10000 numbers!
— Reply to this email directly or view it on GitHub https://github.com/stevengj/DecFP.jl/issues/12#issuecomment-98256680.
Palli.
@stevengj Here are some initial results... this is with the arbitrary precision decNumber type, not even using the shorter types...
julia> @time run(dec)
(1.25 + 2.55) => 25500000001.25, 255
elapsed time: 255.405512641 seconds (9 MB allocated)
julia> @time run(dbl)
(1.250000 + 2.550000) => 25499995395.430843, 24
elapsed time: 23.543102345 seconds (1 kB allocated)
So, it is only about 10x slower than C double... (and gets correct results!) That doing a = 1.25, b = 2.55, and then for (i=0; i<10000000000; i++) a += b;
Can you run that comparison for your Intel library? (I couldn't even measure the time for just 10000 additions!)
int main(int argc, char *argv[]) {
double numa, numb;
numa = 1.25;
numb = 2.55;
long ts = time(0);
for (long i=0; i<10000000000; i++) {
numa += numb;
}
long te = time(0);
printf("(%f + %f) => %f, %ld\n", 1.25, 2.55, numa, te - ts);
return 0;
} // main
int main(int argc, char *argv[]) {
decNumber numa, numb; // working decNumbers
decContext set; // working context
char string[DECQUAD_String]; // number->string buffer
if (argc<3) { // not enough words
printf("Please supply two numbers for a + b).\n");
return 1;
}
decContextDefault(&set, DEC_INIT_DECQUAD); // initialize
decNumberFromString(&numa, argv[1], &set); // get a
decNumberFromString(&numb, argv[2], &set);
long ts = time(0);
for (long i=0; i<10000000000; i++) {
decNumberAdd(&numa, &numa, &numb, &set); // numa=numa + numb
}
long te = time(0);
decNumberToString(&numa, string); // ..
printf("(%s + %s) => %s, %ld\n", argv[1], argv[2], string, te - ts);
return 0;
} // main
2015-05-03 20:48 GMT+00:00 Scott P. Jones notifications@github.com:
So, it is only about 10x slower than C double... (and gets correct results!)
Very good. Before you think this library is better 10x vs. 100x the latter should be compared also with the similar scale. I think that would also get closer to 10x or maybe better (vise versa could also be tried with more random numbers) - probably loose the 9 MB allocation. There must be some fast path..
I actually though about doing some fast path - by wrapping a slow decimal floating point type. Then I got close to 1x compared to Int.
@PallHaraldsson Did I say that I thought that? I've already said that I think the Intel library likely might be faster on Intel hardware (just an assumption that Intel would optimize it greatly), and so we might want to use both libraries on Intel hardware... also decNumber supports most platforms, the Intel library only Intel chips (x86-32, x86-64, ia64). I don't know why @time is showing a difference in memory allocation - I don't see that just running these from the command line... I think it might be some quirk of using run() with @time... I do know about doing a fast path, I implemented a fast decimal arithmetic package some 29 years ago... for 16-bit and 32-bit platforms, in assembly and later in pure C, and also ported it to 64-bit architectures. That uses 72 bits, a 64-bit signed integer, with 8-bit signed base 10 exponent. It doesn't support the IEEE decimal floating point standard (having predated it by 22 years), and doesn't support things like +/-Inf and NaN, but it is fast. I didn't have time for more than a quick test, and I only did so with the decNumber library, because I was already familiar with it... so my first test was the type of operation that is the most frequent in the applications that use my decimal library... (summing up numbers that have the same scale). Later, time permitting, it would be very nice to have a benchmark that showed different types of numbers, showed the speed of all the common operators, and string->decimal, decimal->string conversions, and compared both libraries, along with Float32, Float64, and BigFloat... Maybe you could write that? ;-) (I'm still learning how best to benchmark stuff in Julia)
"Did I say that I thought that?" No you didn't say the library was better, just implied with 10x. Since the Intel library was 100x slower I just wanted to make sure it wasn't discounted by not looking to carefully.
https://software.intel.com/en-us/articles/optimization-notice#opt-en "[..] Optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Notice revision #20110804"
imply to me in-line assembly in library but..
http://www.netlib.org/misc/intel/README.txt
"note that tests involving 80-bit binary floating-point values (these are only conversions to and from decimal floating-point formats) are skipped if the 80-bit floating-point data type is not supported [..] The library was tested on several different platforms (IA-32 Architecture, Intel(R) 64, and IA-64 Architecture [..] In limited situations, it is possible for incorrect compiler behavior to lead to incorrect results in the Intel(r) Decimal Floating-Point Math Library. For example, results of round-to-integer functions are incorrect when the library is built using gcc 4.2/4.3. Also, some gcc versions in an IA-32 Linux environment cause slightly incorrect results in a few corner cases for the 64-bit decimal square root. (This is not an exhaustive list.)"
since IA-64 doesn't have SSE etc. (right?) the library might be more portable than you think. If they just not emulate x86 there then either they also have IA-64 assembly or just portable C code. I wander if the notice is generic.. Also I thought all x86 had 80-bit (including IA-64, or kind of 82-bit..)?)
[Intel had used dirty tricks against AMD (for benchmarking) in the past but it's easy enough to test if the library works (as well) there.]
ARM (and PowerPC) support are coming to Julia, could also be tested, at least ARM even right now I think (or just try to cross-compile the library).
[I'm not too worried about PowerPC, in that case when it comes, you might even want top use the built-in instructions and not having to worry about the same library.. except then you would be using packed..]
https://software.intel.com/en-us/articles/intel-decimal-floating-point-math-library "There are 3 CPU designs [I think (Fujitsu) SPARC64 is missing from this comment] that presently implement IEEE 754-2008. The first was the IBM System z9, in microcode. [..]
IBM's POWER6 (2007) and System z10 (2008) processors both implement IEEE 754-2008 fully in hardware and in every core. By the time they shipped there was much less uncertainty about the final standard.
If you're a "CPU geek," there's a technical paper describing the z10 hardware implementation here:
http://www.research.ibm.com/journal/abstracts/rd/531/schwarz.html
The z10 DFP implementation is very similar to POWER6. There are 54 DFP instructions implemented in hardware, and they are common to the POWER6 and z10 CPUs (and z9, for that matter). The z10 decimal floating point unit adds support for 13 decimal fixed-point instructions, but these are simply preexisting instructions traditionally important to the z CPU family that were relocated and reimplemented for z10, to improve their performance."
I can't locate my Julia code for bit-stealing fast-path.. It was only meant to be a proof-of-concept and not finished as I had too have a real library bit-type for the fallback. The decimal type would also not be fully conformant (only have half the numbers or some hairy code to fix that..).
I was just thinking about this as a hobby Julia and decimal learning experiment. I do not actually have a use fur this code and can't really see that decimal binary floating point in software or hardware is needed for currency. I was just thinking of doing decimal fixed-point, scaled by 100x.
Anyway if I'm mistaken and the range and more than two decimal places and/or NaN and all is needed then I thought the fallback could take care of that.
Wouldn't the 9 MB allocation be because of pass_by_reference? I' do not know about the other library but saw in this one (and since possible and if not there could be added?):
http://www.netlib.org/misc/intel/README.txt "Three build options are provided, that can be set by editing LIBRARY/bid_conf.h, or (more conveniently) can be set on the compile command line.
(a) Function arguments and return values can be passed by reference if DECIMAL_CALL_BY_REFERENCE is set to 1, or by value otherwise. However, the floating-point status flags argument is passed by reference even when DECIMAL_CALL_BY_REFERENCE is 0, unless it is stored in a global variable (see (c) below)."
@PallHaraldsson I'm sorry you thought I was implying that... I hate over generalizations with benchmark results... I showed exactly which numbers I used, how many iterations, and even included the code... I tend to be very OCD about trying to test all the possibilities... You should see my Gist for benchmarking string conversions... I made sure test sizes from 4 characters to 4 million characters... with 6 different types of strings (all ASCII, some ANSI Latin1, some Unicode < 0x800 (i.e. 2-byte UTF-8), some Unicode < 0x10000 (i.e. 3-byte UTF-8 or 2-byte UTF-16), some with Unicode > 0xffff (4-byte UTF-8, or 1 UTF-16 surrogate pairs), and even Unicode > 0xffff encoded incorrectly as either 2 3-byte UTF-8, or 2 4-byte UTF-32 characters, which happens frequently with encoders that don't handle Unicode surrogate pairs), and compared the whole matrix of possible conversions... all that, with 3 different versions of the code (Julia+C, pure Julia, and Julia with some tricks somebody showed me... ended up being faster than the C code)
I'd like to do the same with decimal floating point... the things I most want to test are: string->decimal, decimal->string, decimal->integer, integer->decimal, decimal->float, float->decimal... packed decimal->decimal, decimal->packed decimal, add, subtract, multiply, divide, remainder. Also specific cases that might be optimized, such as add/subtract where the scale is the same, multiply/divide where it is by powers of 10... (I used to play a trick with that, just add or subtract from the scale...) I'm not that interested in benchmarking all of the transcendental or trig functions... I think those sorts of things are better handled with Float64 or BigFloat... (although we should also look carefully into the relative performance difference in Julia between BigFloat and arbitrary precision decimal FP)
Yes, I am a CPU geek also... I ported to the RS/6000 at IBM's request, before it was released publicly... and have been a number of times to their Austin facility to get early info under non-disclosure on their upcoming processors, when they didn't come up to Cambridge to brief the core kernel development team (as did Intel). Cool stuff!
One thing I don't like about decNumber is that it seems to use pass-by-reference for everything, which seems silly for passing and returning 32- and 64-bit values. It may not make much performance difference (since the operations themselves are so expensive), but will make the Julia glue code significantly uglier.
I'm testing a new decNumber wrapper package - I think I've resolved that problem reasonably cleanly for now. To deal with the context stuff, and the pass by reference, I simply set up tuples of Ref{typ}, which I fill out, and then use to pass the immutable values to the C code. I've been thinking also about moving the code function by function into pure Julia, or at least to modify the C code itself (which is under ICU license), to use pass by value for 32/64/128 types, and pass the structure by value for the arbitrary size type. I still need to learn how to deal with the binary dependency, of building the decNumber package itself.
@stevengj I'm close to having something showable based on the decNumber library (I need to figure out how to add building the library from the C code in a package, and add good tests). I wondered if you'd mind my adding the bid <-> dpd conversion functions to DecFP.jl. The decNumber code only supports the 3 dpd formats + an arbitrary precision format, also using declets, but which can be configured to use different declet sizes (the IEEE standard formats are all 10 bit declets, i.e. 0-999), but not the bid formats, so it would be nice to have the conversion functions available. I don't know how things could be arranged so that one or the other, or both packages could be used, for now, I made the macros D32, D64, D128, for example, so as not to conflict.
Great, thanks for persisting on this!
@stevengj I've just put a very cringe-worthy attempt at wrapping the decNumber C library up (the very first package I've created, actually!), still very much a work-in-progress, but if you have some time over IAP, I'd love to hear what you think about it. My goal has shifted more from just wrapping it to putting as much of it in Julia as possible, improving the code (it was really written for 32-bit processors, with C compilers that could pass structures), and also getting the arbitrary precision routines to work as well, but still using immutable types for better Julia performance. Needs fleshing out with promotion rules, conversion to/from DecFP BID format from the DPD format, etc.
FWIW, I've started a pure julia implementation here: https://github.com/quinnj/DecimalNumbers.jl
I'm keeping it simple/basic for now, mainly to get stable functionality for database interop. Feel free to try it out and file any issues you see. I'll be switching ODBC over to this new package for 0.6.
Link to package: https://github.com/quinnj/DecimalNumbers.jl
Your implementation is not IEEE, but I guess for interop it is not critical to obtain optimal packing or binary compatibility?
I think having a pure Julia decimal floating point package would be great, but I do think it would be nicer to have it compatible with the IEEE BID format (using half the space for the exponent seems wasteful, IEEE BID gives you 16 digits and a wide exponent range in just 64 bits, as well as support for NaNs and +/-Inf.
Yes, my two main use-cases are interfacing with databases, as well as a 3rd-party API that provides typed data. Beyond that, only simple calculations are required, so DecimalNumbers.jl should satisfy.
Will you still have the option for using DecFP for ODBC? That's rather important for us.
Given that DecFP is not currently useable on 0.6, I'm not planning on supporting it for a 0.6 release. The lack of windows support has also been an on-going inconsistency that DecimalNumbers will resolve.
The Windows support doesn't affect us (we only deploy on Linux, and do a bit of development on MacOS). I've been so busy getting our product moved to v0.5.1 and finally off of v0.4.7, that I hadn't seen yet that it had problems with v0.6 at the moment. Would it be an option at least to add support to DecimalNumbers to convert back and forth to the IEEE formats? (i.e. would you be interested in a PR implementing that at some point)? Are you planning on handling +/-Inf, and NaN in DecimalNumbers?
(DecFP works in 0.6 now.)
DecFP is also pretty fast (compare it to BigFloat, for example, it is significantly faster)
(Windows works now for DecFP. Finally.)
Has anyone made/seen any recent performance benchmarks (even if not rigorous) between the Intel decimal floating point library and mpdecimal
(which is used e.g. by Python)?
mpdecimal
is an arbitrary-precision decimal floating-point library. It will undoubtedly be orders of magnitude slower than a library like the Intel one that uses predetermined precisions that fit into CPU registers, unless they've special-cased 64-bit precision.
I'm asking because mpdecimal
has some old (I'd guess maybe 10 years old) benchmarks which indicate that Intel BID64 is only about 2x the speed and BID128 is surprisingly about 0.8x the speed. So apparently mpdecimal
uses some very cool magic. So nothing like orders of magnitude slower, but actually just a bit and sometimes even faster.
But I'd like to see some recent comparison.
@ScottPJones points out that the IBM decNumber package (under the ICU license) may be more portable than the Intel package, and supports more formats. We might consider switching to it, instead of the Intel library.
(A more complex option would be to link to both libraries: using Intel where the performance is better, and decNumber where it supports more functionality.)
Would be interesting to compare the performance of the two libraries.