Closed AzamatB closed 5 years ago
Title and post content edited. Printing is debatable (though has nothing inherited from C). Representation (storing) is not.
The "idiosyncratic representation" of unsigned integers mirrors their idiosyncratic behaviors: whereas signed integers overflow at numbers far away from zero, unsigned integers wrap around right where things get interesting — and where overflow bugs are much more likely to occur. All it takes is writing a - b
in the wrong order. As such, I find general arithmetic on unsigned integers or the "enforcement" of a domain constraint with them to be fraught with trouble. The exception, of course, is bitwise arithmetic where you're explicitly treating the integer as a "bag of bits" and thus don't need to worry about overflow (or explicitly rely upon it). So we emphasize that use-case in its printing.
I very much appreciate the current printing and I think many contributors feel similarly. In fact, I would go so far as to say that it's working as designed if it pushed you to use signed integers. :)
In other languages: https://stackoverflow.com/questions/30395205/why-are-unsigned-integers-error-prone https://www.youtube.com/watch?v=wvtFGa6XJDU
Personally, I am annoyed by the printing as well and don't see a good reason for printing unsigned integers in hex. If one want to read numbers in hex, we have functionality for that (string(x; base=16)
). The argument that they are somehow printed badly on purpose to make them annoying enough as to not be used feels a bit patronizing.
I don't think the point is to make them annoying, it's just to indicate type information, which show
and display
generally do. print
already prints unsigned integers in decimal.
If it is about showing type information, then suffix with a u
or something instead seems like a simpler thing than to change the base.
Let's revisit this in a couple of years when we're considering 2.0. Personally, I'm quite pleased with the way this works, but of course, I designed it in the first place. Reasoning given in quite possibly my least popular StackOverflow answer—https://stackoverflow.com/a/27351455/659248:
This is a subjective call, but I think it's worked out pretty well. In my experience when you use hex or binary, you're interested in a specific pattern of bits – and you generally want it to be unsigned. When you're just interested a numeric value you use decimal because that's what we're most familiar with. In addition, when you're using hex or binary, the number of digits you use for input is typically significant, whereas in decimal, it isn't. So that's how literals work in Julia: decimal gives you a signed integer of a type that the value fits in, while hex and binary give you an unsigned value whose storage size is determined by the number of digits.
Or to put it another way: if you're annoyed by the printing of an unsigned integer, you should probably be using a signed integer type instead.
I need to store in a very large array a set of numbers that can only take 4 integer values (say 0, 1, 2, 3). I want to minimize the memory this array takes. So I essentially need an array with UInt2 elements, but this is not available in Julia. I use UInt8, using which I can store 4 numbers in every element, by encoding them (multiplying them by 1, 4, 16 and 64 respectively). I couldn't use Int8 with such a simple encoding (I would need to add these to -128). The fact that I was forced to use hex was quite annoying and error-prone at the beginning, though I have now got used to it.
If using hex is the price for achieving better performance, I don't mind paying it. But it doesn't look like the arguments of anyone here are to improve performance or memory requirements.
If using hex is the price for achieving better performance
This is about printing only. There's nothing performance or memory related about this. The type available should not have any limitation on your storage format. Depending on you need, there are many alternative ways to be more memory efficient and if you need help, please ask on discourse instead.
I find the hexadecimal representation of unsigned integers pedagogically very useful, and very much hope that future versions of Julia keep it that way. It is a common beginner's mistake to use unsigned integers merely because the programmer initially thought that the variable will never hold negative values. Only later do they discover that these variables will then be used in differences, or compared with other expressions (such as differences) that may be negative, and therefore require signed comparison, and then a whole world of pain emerges from unsigned types because understanding the exact overflow behaviour of mixed-signedness integer arithmetic becomes very complex, unintuitive and error prone.
It is much better to always stay with signed integers, unless you absolutely can't avoid unsigned-integer operations. That's why some programming languages newer than C/C++ (most notably Java) even avoided support for unsigned integer types.
There are only very few good reasons for ever using an unsigned integer type:
In particular for the last case, hexadecimal display of such values is usually more convenient anyway.
Please keep the display of unsigned integers hexadecimal in all future generations of Julia, to deter your children and your children's children from using them unnecessarily. It's not a bug, it's a wonderful feature to let programmers know that they are probably doing something questionable.
Yes, very well put. I think we can safely close this. The answer to this "problem" is that if you want something that prints as decimal then you should be using a signed integer type. The usual reason people use unsigned integer types is because they believe that it's a better choice when a value is not supposed to be negative. However, that isn't true—it's almost always still a better to use a signed integer type. As you say, it often happens that you were shortsighted about not needing negative values, and even if you never need negative values for correct code, using a signed integer type allows correctly diagnosing when a negative value has mistakenly been produced (e.g. someone wrote a - b
instead of b - a
), whereas if the type is unsigned, the only hint of what the bug might be is that you got a suspiciously large value.
I really don't buy the argument that one should make something intentionally annoying to use because the thing is not needed most of the time. Because the reason it exists at all is because it does have uses and when that use case come, it will be annoying.
Teaching how one uses things properly should be left to documentation. The implementation should assume that it is being used correctly and provide as good usability as possible.
It’s not intentionally annoying. Hex is much better for most appropriate uses of unsigned integers.
I am responding to the comment that you said was "very well put" that made this argument.
Please keep the display of unsigned integers hexadecimal in all future generations of Julia, to deter your children and your children's children from using them unnecessarily. It's not a bug, it's a wonderful feature to let programmers know that they are probably doing something questionable.
It’s only annoying if you’re using unsigned values wrong, eg using them to count things. If you’re using unsigned values right then it’s beneficial because what you want is a bit pattern.
I understand you - Julia developers - need to make choices, but I disagree that these should be made based on beliefs like that "almost always" or "most appropriate uses" without providing any examples people could actually appreciate the importance of your choice. Still, there will be many situations in which people use unsigned integers appropriately, but would prefer decimal printing (which someone already mentioned can be done using print
).
In particular, I use unsigned integers for encoding qualitative and ordinal variables. I would never think about doing arithmetics on these variables, but I would loop through them forwards or backwards (see e.g. #29801) or I would do arithmetics for encoding reasons (for instance, I need to encode a qualitative variable that takes 3 values in an efficient way, so I use 0x01
, 0x02
, and 0x03
, and then take powers of 4 to encode four of these values into a single UInt8, via assignments like ( variable_temp * 0x04 ) + 0x01
).
Would you say that these uses are not appropriate?
Here some quick examples for why a hexadecimal representation of an unsigned integer value can be vastly more user friendly than a decimal one:
18446744073709551612
? Displayed as 0xfffffffffffffffc
you see immediately that this is 2^64-1-3
, or a word of all but the two least significant bits set to 1.16843009
? Displayed as 0x01010101
you see immediately why this is the number you need to multiply an 8-bit fixed-point value in the (unsigned) range [0,1] with to convert it to the corresponding 32-bit fixed-point value in the range [0,1], as in 0xf9 * 0x01010101 == 0xf9f9f9f9
. Do you really prefer to see this unsigned integer expression as 249 * 16843009 == 4193909241
?1099512414213
? Displayed as 0x00000100000c0005
you can spot instantly that this is a 64-bit word with only 5 bits set to 1, and you can instantly figure out in your head that these bits are set at locations 0, 3, 18, 19, 40: 1<<0+1<<3+1<<18+1<<19+1<<40
67305985
? Displayed as 0x04030201
you instantly recognize ENDIAN_BOM
.16909060
then?4278190335
? Is 0xff0000ff
not a more obvious opaque red?[194, 169, 50, 48, 49, 57]
? Displayed as [0xc2, 0xa9, 0x32, 0x30, 0x31, 0x39]
or 0x39313032a9c2
(on a Littleendian machine) you instantly recognize that this might be a UTF-8 string ending in the four decimal digits 2019, or Vector{UInt8}("©2019")
. (In ASCII bytes, the leading hexadecimal digit indicates the code table column, and non-ASCII characters in UTF-8 all consist of a byte in the range 0xc0–0xf7, followed by one or more bytes in the range 0x80–0xbf, which are all neat and easy to remember hexadecimal boundaries.)0x1 * 0x40 | 0x2 * 0x10 | 0x3 * 0x04 | 0x3 * 0x01
or [1,2,3,3]
.Most high-performance programmers soon end up memorizing the 4-bit patterns behind the 16 hexadecimal digits. Imagine you get any of the above questions posed in decimal rather than hexadecimal during a job interview, because the company has a policy against using hexadecimal numbers. Would you really want to take that job?
@mgkuhn thank you for an illustrative range of examples. I agree, if the reason for this is performance (e.g. fewer translations of decimal to assembler and fewer conversions between different types), then it can be a bit inconvenient (e.g. requiring learning) or annoying (e.g. requiring getting used to a particular syntax) to the programmer. Performance should be the reason mentioned by the developers in the discussion and in the documentation. It may also be explicitly written in the documentation that unsigned integer types are created with the objective to facilitate the encoding of nominal and ordinal variables (such as colors, text), emphasize its overflow behaviour, and discourage from their (error-prone) use for numerical variables. For the above reasons, I would also appreciate adding UInt1, UInt2, and UInt4, and improving/correcting/rethinking the functions that deal e.g. with looping (including backwards and unordered looping) through UInt variables.
This is about printing only. There's nothing performance or memory-related about this.
Heh . . . may as well bump this, 4 years later.
I am more comfortable than most, thinking in binary. e.g. *nix file modes, ip addresses, etc., etc., are much easier for me to make sense of when the digit is encoded as a multiple of 2 (and/or the the typical computer "word" encoding length). But all that is just an artifact of the limitation of most computer architectures (e.g. life can encode in quaternary [A, C, G, T]). The following symbols are an artifact of human evolution: [0, 1, 2, 3, 4, 5, 6, 7, 9]. (decimal; that many digits on our hands). Rather than (making up and) using a couple of symbols unique to binary encoding, we borrow from the decimal set [0, 1]. When the binary display gets too long to handle/read, we compress multiple bits (binary digits) into octal or hexadecimal digits. We run out of decimal symbols, for hexadecimal. Here, again, rather than using symbols specific to a new encoding, we borrow from a set already familiar to humans: [a, b, c, d, e, f]. So, in all cases, we are already representing things in a way biased to human evolution.
It just seems weird and confusing and inconsistent to display two integer types in two very different fundamental encoding ways. And, since a hexadecimal representation is already arbitrarily biased towards what humans are comfortable with (e.g. they use as many of the decimal symbols as they can, rather than making up their own their own, like DNA), why not just just go all the way to make it consistent with all the other representations (e.g. in a REPL) which are designed to make the most sense to most humans? If you are doing lower level things like sysadmin or computer engineering, in this high-level language, why not just convert to hexadecimal, if it helps you see the bit patterns most meaningful to low level encodings and processing (e.g. masking, adding, etc.). No matter what, in Julia, you still have to rely on error handling to deal with overflow. Ok, so with signed Int, you might get a signal when something unexpectedly goes negative. But that is also about twice as likely to happen, for a given bit width, vs. UInt, simply because there are fewer values to use, before you run out. Either way, you have to manually deal with a given size limitation
Computers are for modeling the physical world for humans. Physically speaking, negative numbers are a bit unnatural. Signed numbers are more like a shortcut/workaround: We want a shortcut symbolic way to represent seeing another perspective, without leaving ours, or we want to represent something unnatural like turning a process in reverse (time travel, effectively). So maybe signed numbers should get the special, uncommon, hex encoding display, to clearly signal that weirdness??? Obviously such "shortcuts" often simplify, or make certain problems tractable. But I think getting rid of signed Ints makes about as much sense as any argument that it makes sense to display UInt in a special way, by default.
For a recent use case, a variable total
is meant to "model" the total number of records to retrieve from a data API.
It makes no sense for that number be negative (Am I going to give data to that "get" REST API???).
I can only imagine code referring to total
, vs mutating it (treated like a local const).
I suppose, with signed Int, about half the time you'll end up in a state you can possibly use as a sentinel to an to an overflow error. But it seems wasteful to use half of the data space, in a type, for different versions of the same error code.
But it seems wasteful to use half of the data space, in a type, for different versions of the same error code.
I frequently parse and process binary data of somewhat "dynamic" nature (mostly Garmin FIT and GoPro GPMF telemetry formats). Many values are stored as unsigned integers (there are responses above as to why even the possibility to store signed types makes no sense for some data, and can even be a help or insurance).
This parser journey started with Python, then Julia, and currently Rust - which I'm now most comfortable with, even though I'm a beginner at all of them. I really want Julia to be my "better" Python, for all the reasons Julia first came to be - the type system being one of the important factors. Julia's REPL is my go-to calculator etc and is open at all times. There's so much to like.
Yet, this seemingly "insignificant" decision of printing unsigned ints as hex makes me temporarily abandon Julia every time. When parsing I want the decimal representation of all numerical values, and it is important that the initial parse preserves the type as it was stored without casting to another type - that's for later. There is also nothing special with the unsigned int: it's all numerical data, but the choice between storing as signed or unsigned is often intentional. Latitude may be stored in semicircles as Int32, speed scalar as UInt32 etc.
This thread at the forums is frankly odd, and noone even suggests how to print unsigned ints as decimal (@printf
or as interpolated string I guess), but instead seem to go out of their way to give reasons for not using unsigned ints in the first place.
Julia isn't even consistent for println
:
julia> println(UInt8(255))
255
julia> println(UInt8[255, 255])
UInt8[0xff, 0xff]
Rust - which can be used for systems programming (think printing memory addresses etc) - doesn't take the Julia stance. Via evxcr:
>> dbg!(2_u8)
[src/lib.rs:22] 2_u8 = 2
2
>> 255_u8
255
>> 256_u8
[overflowing_literals] Error: literal out of range for `u8`
// And if I want hex, it's simple enough, similar to Julia's printf (upper case `X` for upper case hex):
>> println!("{:X}", 255_u8);
FF
>> println!("{:x?}", vec![255_u8, 255_u8]);
[ff, ff]
I don't expect this to change, so I'll find a way to live with it. I do, however, find the tone and arguments around this issue surprising. I also worry a bit about having to write my own show
implementation for all data structures I need.
MATLAB has a quite useful format
command to change the default output display style of numbers when values are printed by its REPL. It permits control over a choice of different floating-point number formats (short, long, E, G, Eng), as well as integers (decimal, hex), and the amount of white space (compact, loose) used when values are output.
Julia has already an IOContext type that similarly controls some pretty-printing parameters (e.g., compact, limit, displaysize, typeinfo, color).
My suggestions would be to extend Base.IOContext
to also provide control over number formatting, similar to what MATLAB's format
offers, including
Note that this would not be a change to the Julia language or syntax, or even to the default behaviour of the REPL, but it would just be an extension of capabilities to Base.IOContext
and some of the show
methods involved.
Users annoyed by hexadecimal UInt8[0xff, 0xff]
displays could then simply change their dec/hex preference via an IOContext
, including the REPL default one in Base.active_repl.options.iocontext
dictionary, and instead get UInt8[255, 255]
shown.
I suspect that may be a very useful extension that might address large parts of the UInt
concerns raised in this issue (and also give users more control over the float array notation).
Some things could be done better than MATLAB, e.g. instead of just short/long float notation, we could allow more fine-grained control over the number of significant digits shown (lossy output).
For older programmers, like myself, who prefer to think in octal (much smaller multiplication table than denary or senidenary!), how about allowing
Base.active_repl.options.iocontext[:uint_base] = 8;
Base.active_repl.options.iocontext[:uint_digits] = 0;
to switch from
julia> show(UInt8[0,8,64,255])
UInt8[0x00, 0x08, 0x40, 0xff]
to
julia> show(UInt8[0,8,64,255])
UInt8[0o0, 0x10, 0o100, 0o377]
i.e. in octal without any leading zeros, as unsigned integers where really meant to be appreciated.
Personally, I'd be perfectly happy with such a config option, or syntax for formatted print
(for print statements outside of the REPL). But I guess @printf
is there for the latter.
I would push the other direction: Unsigned numbers should print in hex in more situations. For example string(0xFF)
should be "0xFF".
After consulting on the slack channel, I am still not aware of a good reason why
Unsigned
types are printed in the hexadecimal base (looks like it was a decision that inherited from C). But there is a number of reasons to represent them in the usual decimal base. I associate them with natural numbers, but their idiosyncratic representation makes them unintuitive to interpret, so I end-up opting forIntegers
just for the sake of decimal representation even when onlyUnsigned
would suffice.