Change printing of Unsigned types to decimal base

AzamatB commented 5 years ago

After consulting on the slack channel, I am still not aware of a good reason why Unsigned types are printed in the hexadecimal base (looks like it was a decision that inherited from C). But there is a number of reasons to represent them in the usual decimal base. I associate them with natural numbers, but their idiosyncratic representation makes them unintuitive to interpret, so I end-up opting for Integers just for the sake of decimal representation even when only Unsigned would suffice.

yuyichao commented 5 years ago

Title and post content edited. Printing is debatable (though has nothing inherited from C). Representation (storing) is not.

mbauman commented 5 years ago

The "idiosyncratic representation" of unsigned integers mirrors their idiosyncratic behaviors: whereas signed integers overflow at numbers far away from zero, unsigned integers wrap around right where things get interesting — and where overflow bugs are much more likely to occur. All it takes is writing a - b in the wrong order. As such, I find general arithmetic on unsigned integers or the "enforcement" of a domain constraint with them to be fraught with trouble. The exception, of course, is bitwise arithmetic where you're explicitly treating the integer as a "bag of bits" and thus don't need to worry about overflow (or explicitly rely upon it). So we emphasize that use-case in its printing.

I very much appreciate the current printing and I think many contributors feel similarly. In fact, I would go so far as to say that it's working as designed if it pushed you to use signed integers. :)

In other languages: https://stackoverflow.com/questions/30395205/why-are-unsigned-integers-error-prone https://www.youtube.com/watch?v=wvtFGa6XJDU

KristofferC commented 5 years ago

Personally, I am annoyed by the printing as well and don't see a good reason for printing unsigned integers in hex. If one want to read numbers in hex, we have functionality for that (string(x; base=16)). The argument that they are somehow printed badly on purpose to make them annoying enough as to not be used feels a bit patronizing.

JeffBezanson commented 5 years ago

I don't think the point is to make them annoying, it's just to indicate type information, which show and display generally do. print already prints unsigned integers in decimal.

KristofferC commented 5 years ago

If it is about showing type information, then suffix with a u or something instead seems like a simpler thing than to change the base.

StefanKarpinski commented 5 years ago

Let's revisit this in a couple of years when we're considering 2.0. Personally, I'm quite pleased with the way this works, but of course, I designed it in the first place. Reasoning given in quite possibly my least popular StackOverflow answer—https://stackoverflow.com/a/27351455/659248:

This is a subjective call, but I think it's worked out pretty well. In my experience when you use hex or binary, you're interested in a specific pattern of bits – and you generally want it to be unsigned. When you're just interested a numeric value you use decimal because that's what we're most familiar with. In addition, when you're using hex or binary, the number of digits you use for input is typically significant, whereas in decimal, it isn't. So that's how literals work in Julia: decimal gives you a signed integer of a type that the value fits in, while hex and binary give you an unsigned value whose storage size is determined by the number of digits.

Or to put it another way: if you're annoyed by the printing of an unsigned integer, you should probably be using a signed integer type instead.

PeterJacko commented 5 years ago

I need to store in a very large array a set of numbers that can only take 4 integer values (say 0, 1, 2, 3). I want to minimize the memory this array takes. So I essentially need an array with UInt2 elements, but this is not available in Julia. I use UInt8, using which I can store 4 numbers in every element, by encoding them (multiplying them by 1, 4, 16 and 64 respectively). I couldn't use Int8 with such a simple encoding (I would need to add these to -128). The fact that I was forced to use hex was quite annoying and error-prone at the beginning, though I have now got used to it.

If using hex is the price for achieving better performance, I don't mind paying it. But it doesn't look like the arguments of anyone here are to improve performance or memory requirements.

yuyichao commented 5 years ago

If using hex is the price for achieving better performance

This is about printing only. There's nothing performance or memory related about this. The type available should not have any limitation on your storage format. Depending on you need, there are many alternative ways to be more memory efficient and if you need help, please ask on discourse instead.

mgkuhn commented 5 years ago

I find the hexadecimal representation of unsigned integers pedagogically very useful, and very much hope that future versions of Julia keep it that way. It is a common beginner's mistake to use unsigned integers merely because the programmer initially thought that the variable will never hold negative values. Only later do they discover that these variables will then be used in differences, or compared with other expressions (such as differences) that may be negative, and therefore require signed comparison, and then a whole world of pain emerges from unsigned types because understanding the exact overflow behaviour of mixed-signedness integer arithmetic becomes very complex, unintuitive and error prone.

It is much better to always stay with signed integers, unless you absolutely can't avoid unsigned-integer operations. That's why some programming languages newer than C/C++ (most notably Java) even avoided support for unsigned integer types.

There are only very few good reasons for ever using an unsigned integer type:

If for memory reasons the storage type has to be very small, sometimes the range extension provided by the most significant bit really can make a difference to the application. This is usually only the case for the smallest integer types, e.g. 8-bit pixels in graphics, and the actual arithmetic is usually still done in a larger signed type.
If an unsigned type was already specified in some external data format or interface (e.g., a protocol header), your standards-conformant implementation must work correctly all the way to the highest possible integer number. (Sadly some authors of such specifications also don't understand that requiring the MSB of a 32-bit or 64-bit field to remain zero would have made its safe and correct implementation and verification so much easier.)
The variable contains not a number but a sequence of bits, and unsigned arithmetic allows you to perform some required operation on this bit sequence faster.

In particular for the last case, hexadecimal display of such values is usually more convenient anyway.

Please keep the display of unsigned integers hexadecimal in all future generations of Julia, to deter your children and your children's children from using them unnecessarily. It's not a bug, it's a wonderful feature to let programmers know that they are probably doing something questionable.

StefanKarpinski commented 5 years ago

Yes, very well put. I think we can safely close this. The answer to this "problem" is that if you want something that prints as decimal then you should be using a signed integer type. The usual reason people use unsigned integer types is because they believe that it's a better choice when a value is not supposed to be negative. However, that isn't true—it's almost always still a better to use a signed integer type. As you say, it often happens that you were shortsighted about not needing negative values, and even if you never need negative values for correct code, using a signed integer type allows correctly diagnosing when a negative value has mistakenly been produced (e.g. someone wrote a - b instead of b - a), whereas if the type is unsigned, the only hint of what the bug might be is that you got a suspiciously large value.

KristofferC commented 5 years ago

I really don't buy the argument that one should make something intentionally annoying to use because the thing is not needed most of the time. Because the reason it exists at all is because it does have uses and when that use case come, it will be annoying.

Teaching how one uses things properly should be left to documentation. The implementation should assume that it is being used correctly and provide as good usability as possible.

StefanKarpinski commented 5 years ago

It’s not intentionally annoying. Hex is much better for most appropriate uses of unsigned integers.

KristofferC commented 5 years ago

I am responding to the comment that you said was "very well put" that made this argument.

Please keep the display of unsigned integers hexadecimal in all future generations of Julia, to deter your children and your children's children from using them unnecessarily. It's not a bug, it's a wonderful feature to let programmers know that they are probably doing something questionable.

StefanKarpinski commented 5 years ago

It’s only annoying if you’re using unsigned values wrong, eg using them to count things. If you’re using unsigned values right then it’s beneficial because what you want is a bit pattern.

PeterJacko commented 5 years ago

I understand you - Julia developers - need to make choices, but I disagree that these should be made based on beliefs like that "almost always" or "most appropriate uses" without providing any examples people could actually appreciate the importance of your choice. Still, there will be many situations in which people use unsigned integers appropriately, but would prefer decimal printing (which someone already mentioned can be done using print).

In particular, I use unsigned integers for encoding qualitative and ordinal variables. I would never think about doing arithmetics on these variables, but I would loop through them forwards or backwards (see e.g. #29801) or I would do arithmetics for encoding reasons (for instance, I need to encode a qualitative variable that takes 3 values in an efficient way, so I use 0x01, 0x02, and 0x03, and then take powers of 4 to encode four of these values into a single UInt8, via assignments like ( variable_temp * 0x04 ) + 0x01).

Would you say that these uses are not appropriate?

mgkuhn commented 5 years ago

Here some quick examples for why a hexadecimal representation of an unsigned integer value can be vastly more user friendly than a decimal one:

What is 18446744073709551612? Displayed as 0xfffffffffffffffc you see immediately that this is 2^64-1-3, or a word of all but the two least significant bits set to 1.
What is 16843009? Displayed as 0x01010101 you see immediately why this is the number you need to multiply an 8-bit fixed-point value in the (unsigned) range [0,1] with to convert it to the corresponding 32-bit fixed-point value in the range [0,1], as in 0xf9 * 0x01010101 == 0xf9f9f9f9. Do you really prefer to see this unsigned integer expression as 249 * 16843009 == 4193909241?
What is 1099512414213? Displayed as 0x00000100000c0005 you can spot instantly that this is a 64-bit word with only 5 bits set to 1, and you can instantly figure out in your head that these bits are set at locations 0, 3, 18, 19, 40: 1<<0+1<<3+1<<18+1<<19+1<<40
What is 67305985? Displayed as 0x04030201 you instantly recognize ENDIAN_BOM.
What is 16909060 then?
What RGBA colour is 4278190335? Is 0xff0000ff not a more obvious opaque red?
What is [194, 169, 50, 48, 49, 57]? Displayed as [0xc2, 0xa9, 0x32, 0x30, 0x31, 0x39] or 0x39313032a9c2 (on a Littleendian machine) you instantly recognize that this might be a UTF-8 string ending in the four decimal digits 2019, or Vector{UInt8}("©2019"). (In ASCII bytes, the leading hexadecimal digit indicates the code table column, and non-ASCII characters in UTF-8 all consist of a byte in the range 0xc0–0xf7, followed by one or more bytes in the range 0x80–0xbf, which are all neat and easy to remember hexadecimal boundaries.)
What is 111 when interpreted as a packed sequence of 2-bit unsigned integer values, as suggested by @PeterJacko above? Displayed as 0x6f you can far more easily split that bit pattern in your head into 0x1 * 0x40 | 0x2 * 0x10 | 0x3 * 0x04 | 0x3 * 0x01 or [1,2,3,3].

Most high-performance programmers soon end up memorizing the 4-bit patterns behind the 16 hexadecimal digits. Imagine you get any of the above questions posed in decimal rather than hexadecimal during a job interview, because the company has a policy against using hexadecimal numbers. Would you really want to take that job?

PeterJacko commented 5 years ago

@mgkuhn thank you for an illustrative range of examples. I agree, if the reason for this is performance (e.g. fewer translations of decimal to assembler and fewer conversions between different types), then it can be a bit inconvenient (e.g. requiring learning) or annoying (e.g. requiring getting used to a particular syntax) to the programmer. Performance should be the reason mentioned by the developers in the discussion and in the documentation. It may also be explicitly written in the documentation that unsigned integer types are created with the objective to facilitate the encoding of nominal and ordinal variables (such as colors, text), emphasize its overflow behaviour, and discourage from their (error-prone) use for numerical variables. For the above reasons, I would also appreciate adding UInt1, UInt2, and UInt4, and improving/correcting/rethinking the functions that deal e.g. with looping (including backwards and unordered looping) through UInt variables.

AzamatB commented 5 years ago

This is about printing only. There's nothing performance or memory-related about this.

bdklahn commented 1 year ago

Heh . . . may as well bump this, 4 years later.

I am more comfortable than most, thinking in binary. e.g. *nix file modes, ip addresses, etc., etc., are much easier for me to make sense of when the digit is encoded as a multiple of 2 (and/or the the typical computer "word" encoding length). But all that is just an artifact of the limitation of most computer architectures (e.g. life can encode in quaternary [A, C, G, T]). The following symbols are an artifact of human evolution: [0, 1, 2, 3, 4, 5, 6, 7, 9]. (decimal; that many digits on our hands). Rather than (making up and) using a couple of symbols unique to binary encoding, we borrow from the decimal set [0, 1]. When the binary display gets too long to handle/read, we compress multiple bits (binary digits) into octal or hexadecimal digits. We run out of decimal symbols, for hexadecimal. Here, again, rather than using symbols specific to a new encoding, we borrow from a set already familiar to humans: [a, b, c, d, e, f]. So, in all cases, we are already representing things in a way biased to human evolution.

It just seems weird and confusing and inconsistent to display two integer types in two very different fundamental encoding ways. And, since a hexadecimal representation is already arbitrarily biased towards what humans are comfortable with (e.g. they use as many of the decimal symbols as they can, rather than making up their own their own, like DNA), why not just just go all the way to make it consistent with all the other representations (e.g. in a REPL) which are designed to make the most sense to most humans? If you are doing lower level things like sysadmin or computer engineering, in this high-level language, why not just convert to hexadecimal, if it helps you see the bit patterns most meaningful to low level encodings and processing (e.g. masking, adding, etc.). No matter what, in Julia, you still have to rely on error handling to deal with overflow. Ok, so with signed Int, you might get a signal when something unexpectedly goes negative. But that is also about twice as likely to happen, for a given bit width, vs. UInt, simply because there are fewer values to use, before you run out. Either way, you have to manually deal with a given size limitation

Computers are for modeling the physical world for humans. Physically speaking, negative numbers are a bit unnatural. Signed numbers are more like a shortcut/workaround: We want a shortcut symbolic way to represent seeing another perspective, without leaving ours, or we want to represent something unnatural like turning a process in reverse (time travel, effectively). So maybe signed numbers should get the special, uncommon, hex encoding display, to clearly signal that weirdness??? Obviously such "shortcuts" often simplify, or make certain problems tractable. But I think getting rid of signed Ints makes about as much sense as any argument that it makes sense to display UInt in a special way, by default.

For a recent use case, a variable total is meant to "model" the total number of records to retrieve from a data API. It makes no sense for that number be negative (Am I going to give data to that "get" REST API???). I can only imagine code referring to total, vs mutating it (treated like a local const). I suppose, with signed Int, about half the time you'll end up in a state you can possibly use as a sentinel to an to an overflow error. But it seems wasteful to use half of the data space, in a type, for different versions of the same error code.

StefanKarpinski commented 1 year ago

But it seems wasteful to use half of the data space, in a type, for different versions of the same error code.

7tksrw

jenslar commented 1 year ago

I frequently parse and process binary data of somewhat "dynamic" nature (mostly Garmin FIT and GoPro GPMF telemetry formats). Many values are stored as unsigned integers (there are responses above as to why even the possibility to store signed types makes no sense for some data, and can even be a help or insurance).

This parser journey started with Python, then Julia, and currently Rust - which I'm now most comfortable with, even though I'm a beginner at all of them. I really want Julia to be my "better" Python, for all the reasons Julia first came to be - the type system being one of the important factors. Julia's REPL is my go-to calculator etc and is open at all times. There's so much to like.

Yet, this seemingly "insignificant" decision of printing unsigned ints as hex makes me temporarily abandon Julia every time. When parsing I want the decimal representation of all numerical values, and it is important that the initial parse preserves the type as it was stored without casting to another type - that's for later. There is also nothing special with the unsigned int: it's all numerical data, but the choice between storing as signed or unsigned is often intentional. Latitude may be stored in semicircles as Int32, speed scalar as UInt32 etc.

This thread at the forums is frankly odd, and noone even suggests how to print unsigned ints as decimal (@printf or as interpolated string I guess), but instead seem to go out of their way to give reasons for not using unsigned ints in the first place.

Julia isn't even consistent for println:

julia> println(UInt8(255))
255
julia> println(UInt8[255, 255])
UInt8[0xff, 0xff]

Rust - which can be used for systems programming (think printing memory addresses etc) - doesn't take the Julia stance. Via evxcr:

>> dbg!(2_u8)
[src/lib.rs:22] 2_u8 = 2
2
>> 255_u8
255
>> 256_u8
[overflowing_literals] Error: literal out of range for `u8`
// And if I want hex, it's simple enough, similar to Julia's printf (upper case `X` for upper case hex):
>> println!("{:X}", 255_u8);
FF
>> println!("{:x?}", vec![255_u8, 255_u8]);
[ff, ff]

I don't expect this to change, so I'll find a way to live with it. I do, however, find the tone and arguments around this issue surprising. I also worry a bit about having to write my own show implementation for all data structures I need.

mgkuhn commented 1 year ago

MATLAB has a quite useful format command to change the default output display style of numbers when values are printed by its REPL. It permits control over a choice of different floating-point number formats (short, long, E, G, Eng), as well as integers (decimal, hex), and the amount of white space (compact, loose) used when values are output.

Julia has already an IOContext type that similarly controls some pretty-printing parameters (e.g., compact, limit, displaysize, typeinfo, color).

My suggestions would be to extend Base.IOContext to also provide control over number formatting, similar to what MATLAB's format offers, including

decimal/hex preference for integers
precision and exponent-notation preferences for floats.

Note that this would not be a change to the Julia language or syntax, or even to the default behaviour of the REPL, but it would just be an extension of capabilities to Base.IOContext and some of the show methods involved.

Users annoyed by hexadecimal UInt8[0xff, 0xff] displays could then simply change their dec/hex preference via an IOContext, including the REPL default one in Base.active_repl.options.iocontext dictionary, and instead get UInt8[255, 255] shown.

I suspect that may be a very useful extension that might address large parts of the UInt concerns raised in this issue (and also give users more control over the float array notation).

Some things could be done better than MATLAB, e.g. instead of just short/long float notation, we could allow more fine-grained control over the number of significant digits shown (lossy output).

mgkuhn commented 1 year ago

For older programmers, like myself, who prefer to think in octal (much smaller multiplication table than denary or senidenary!), how about allowing

Base.active_repl.options.iocontext[:uint_base] = 8;
Base.active_repl.options.iocontext[:uint_digits] = 0;

to switch from

julia> show(UInt8[0,8,64,255])
UInt8[0x00, 0x08, 0x40, 0xff]

to

julia> show(UInt8[0,8,64,255])
UInt8[0o0, 0x10, 0o100, 0o377]

i.e. in octal without any leading zeros, as unsigned integers where really meant to be appreciated.

jenslar commented 1 year ago

Personally, I'd be perfectly happy with such a config option, or syntax for formatted print (for print statements outside of the REPL). But I guess @printfis there for the latter.

kuszmaul commented 1 day ago

I would push the other direction: Unsigned numbers should print in hex in more situations. For example string(0xFF) should be "0xFF".

JuliaLang / julia

Change printing of Unsigned types to decimal base #30167