crystal-lang / crystal

The Crystal Programming Language
https://crystal-lang.org
Apache License 2.0
19.47k stars 1.62k forks source link

Sum of Int32 and Int64 values together should be always promoted to Int64 #567

Closed ssvb closed 8 years ago

ssvb commented 9 years ago

The Crystal code:

v32 = 2147483647.to_i32
v64 = 2147483647.to_i64
one = 1.to_i32
STDOUT << "Int64 + Int32 :" << v64 + one << "\n"
STDOUT << "Int32 + Int64 :" << one + v64 << "\n"
STDOUT << "Int32 + Int32 :" << v32 + one << "\n"
STDOUT << "Int32 + Int32 :" << one + v32 << "\n"

The Crystal result:

Int64 + Int32 :2147483648
Int32 + Int64 :-2147483648 <= This is bad!
Int32 + Int32 :-2147483648
Int32 + Int32 :-2147483648

This behaviour is not very nice, because we are losing the http://en.wikipedia.org/wiki/Commutative_property

The equivalent C++ code:

#include <iostream>
#include <inttypes.h>

int main(void)
{
  int32_t v32 = 2147483647;
  int64_t v64 = 2147483647;
  int32_t one = 1;
  std::cout << "Int64 + Int32 : " << v64 + one << "\n";
  std::cout << "Int32 + Int64 : " << one + v64 << "\n";
  std::cout << "Int32 + Int32 : " << v32 + one << "\n";
  std::cout << "Int32 + Int32 : " << one + v32 << "\n";
  return 0;
}

The C++ result:

Int64 + Int32 : 2147483648
Int32 + Int64 : 2147483648 <= This is good
Int32 + Int32 : -2147483648
Int32 + Int32 : -2147483648

The other arithmetic operations with integer values of different size should also behave in a similar way. So that the result should be promoted to the larger type, regardless of the order of the operands in an expression. Otherwise Int64 variables may be sometimes accidentally demoted to Int32 size in the expressions which are, for example, using integer literals.

asterite commented 9 years ago

Yes, I think when doing A op B the result should be of the largest type, so the current behaviour is buggy.

The only question is what happens when you do A op B and A is signed and B is not. For example 1_i32 + 1_u32, what's the type of that?

ssvb commented 9 years ago

You have a good point about the potential troubles involving a mix of signed/unsigned variables in the same expression. I have filed a related bug #571 about their usage in comparisons.

Putting a Ruby programmer hat on, I would say that we don't even strictly need unsigned integer types at all. For example, the change from Int32 to UInt32 only extends the available range of the representable positive numbers a little bit. Going from Int32 to Int64 also solves this problem, albeit at the expense of increasing the storage space requirements and a bit of performance degradation. And even if Int64 is not enough, we still have BigInt.

Putting a C/C++ programmer hat on, I can see the usefulness of unsigned integer types for certain types of algorithms (encryption, prng, etc.) where they offer significantly better performance. And also bindings for the external libraries may use unsigned integer types for certain data structures and function arguments.

I personally would probably treat a mix of signed and unsigned variables in the same expression as a compile time error. Except for the case when we have something like a mix of UInt32 and Int64 variables, where UInt32 can be safely converted to Int64 without any data loss.

asterite commented 9 years ago

@ssvb Today we brainstormed a bit with @waj about this. We'd like to try this: if you have L op R then R must be in L, and then result will be of the type of L. For example:

1_i32 + 1_i32 # ok, result is Int32
1_i32 + 1_i8 # ok, Int8 fits in Int32, result is Int32
1_i32 + 1_i64 # error, Int64 doesn't fit in Int32 (the error will be the usual "no overload matches")

That will avoid unwanted unions in the case where you do a += value, and also will prevent some possible errors.

The bad thing is that this will give an error:

a = 1_i8
a += 1

because Int32 doesn't fit in Int8. Of course, this is obvious for a number literal without an explicit type annotation (like 1_i32) so we'll maybe add an extra rule for untyped number literals to be able to be autocast to the right type in this case. But this is maybe, it's just something nice to have.

What do you think?

As for operations between signed and unsigned type, they will always give a compile-time error (i.e., those overloads won't be available).

ssvb commented 9 years ago

@asterite Thanks, this looks reasonable to me. I have already found the relevant parts of the compiler code and have been experimenting with this stuff myself. Introducing stricter checks or implicit widening causes a bunch of compilation failures in the standard library, many of these in fact really look suspicious and are worth fixing anyway.

Yes, the implicit autocast for number literals should be quite useful too.

vjdhama commented 8 years ago

@asterite is there any update on this issue?

asterite commented 8 years ago

@vjdhama Not yet, this has less priority then other things.

ssvb commented 8 years ago

It's this time of year again and I would like to use the Crystal compiler at the Google Code Jam 2016 contest for the performance critical code. To move things forward, now I tried to make a patch (pull request #2446) which only takes care of promoting results to a larger data type for only a few relatively non-controversial cases (Int32 to Int64, UInt32 to Int64 and UInt32 to UInt64). Unlike a more generic solution mentioned in my previous comment, this does not cause massive compilation failures in the Crystal standard library and is also able to pass the test suite.

In practice, this fix is important for my old Crystal based solution for the Round 1B 2015, Problem C (Hiking Deer) problem. Currently this code is littered with the casts of numeric literals to Int64, and accidentally forgetting even a single one is a recipe for failure! But using a patched Crystal compiler, I can safely replace all the occurrences of 360.to_i64 in the code with just 360 :-)

asterite commented 8 years ago

@ssvb Great! I used Crystal for adventofcode and it was great, code was small in most cases and with very good performance (in Ruby similar programs usually take a lot more time, mostly due to math operations).

We'll review #2446 with @waj some weeks later, right now it's kind of impossible for us to do. I'm still not sure about that solution. For example this is probably unexpected:

a = 1_i64
b = 1
b += a # same as b = b + a, so now b is suddenly and silently an Int64

The current rule is: the type is the one from the left hand side of the operation. In #2446 you put "(1 + a) * b" as an example, and that's easily fixable if you remember that rule. So you change it to "(a + 1) * b" and it will have the correct type.

I'm not saying this rule is the one that has to stay, this is still an issue for which we'll have to figure the best solution (this should be considered too).

asterite commented 8 years ago

I'm positive that we will fix this before 1.0 as this is very important... we at least have #571 fixed now 😄.

ozra commented 8 years ago

@asterite, I think the b += a being promoted should be perfectly OK.

I think this is an important issue to address, and type promotion in a well defined fashion takes the coders mind of unnecessary thinking for stuff that could just work.

Personally I don't like the L - R ordering requirements put on the coder.

Just my 2 mBTC...

asterite commented 8 years ago

@ozra Though it wouldn't be just type promotion, there's no such thing in Crystal: if for example inside a loop, b will become a union of Int32 and Int64, which is probably not what you want (slower).

ozra commented 8 years ago

I meant type promotion in a loose sense. Yes the loop case shows an uglier side of it I didn't take into account, damn :-/

asterite commented 8 years ago

I'm closing this. The most correct behaviour for this language, where you can have @a += b, is to make the expression always have the type of a to avoid creating unions. So in general A op B has a type A. There are a few exceptions, for example when A is an integer and B is a float, but if both types are integers or floats, the type remains that of A.

ssvb commented 8 years ago

Fair enough. Still with such a behaviour, it is very difficult to write correct code when doing any kind of numeric calculations. I do think that the loss of the traditional commutative property is a bad decision. The unions of basic integers with different size is also IMHO a nonsense thing in general. Simply promoting them to the largest type would even use storage space more efficiently in containers, such as arrays.

ozra commented 8 years ago

Agree that "machine-near-types" (low-level) should be given extra attention by the compiler. While handling high level types uniformly is good, clean and easier to maintain, the horizon meeting the machine should be respected. For instance, the loop case: a += b. The better thing to do is probably to promote Tx to Ty (where Ty is wider than Tx), and then:

So, this will be special case and will handle only machine-near types.

It could even be done in a separate pass after the common inference, since it's specific goal is "respect for the machine", not the abstract type system.

Regarding unsigned, I have thought about it for a long while but not issued: there should be a Natural, Nat32, Nat64, etc. type that is 0..BitWidthSignedMax (not using the complement) to be favoured everywhere one needs natural number, and spare unsigned for use only in interfacing with C where required, and very special uses where the extra bit in a 64 bit Unsigned really is needed.

asterite commented 8 years ago

The main issue with automatic promotion is that later you'll want Int32 to be promoted to BigInt, and so on. Should BigInt be known by the compiler then? What if a user introduces user-defined numeric types?

Eventually, we'd like these rules to be customizable, probably in the form of implicit conversions. Then, implicit conversions could be applied to any type, not just numeric types. Then the language becomes a mess because no one understands how a type got to a point in the first place because it was implicitly converted at some previous point (no syntax clue for this).

If math operators don't do promotions and you are forced to use them, this problem goes away (though you have to write a bit more, but just that)

If you know your program will likely reach Int64 or BigInt levels, use those types in the first place. That's my advice. Then using "lower" types will always work: Int64 + Int32 => Int64, BigInt + Int32 => Int32, etc. You just have to keep the bigger type on the left hand side.

ssvb commented 8 years ago

The main issue with automatic promotion is that later you'll want Int32 to be promoted to BigInt, and so on.

No, we will not. I'll try to explain.

One reason being that the implicit promotion to BigInt is already working in a reasonably predictable way, you can easily see this with the following example:

require "big_int"

a = 1.to_big_i
b = 1 + a

pp typeof(a) # => BigInt
pp typeof(b) # => BigInt

c = 1.to_i64
d = 1 + c

pp typeof(c) # => Int64
pp typeof(d) # => Int32

Another reason is the performance. Basic integer types have very similar performance, regardless of their size. Yes, Int64 is somewhat slower than Int32 on a 32-bit machine, but the performance of 64-bit arithmetic operations is still very much reasonable (that's where we need some benchmarks in this discussion, and I can provide them if you care). The real practical difference and the reason to have things like the Int8 type is the storage space saving concern when you have large arrays in memory.

For comparison, BigInt is orders of magnitude slower than native integer types. So while I don't care too much about the size of the basic integer types when they are used for automatic variables on stack, BigInt can be a big performance problem.

Eventually, we'd like these rules to be customizable, probably in the form of implicit conversions. Then, implicit conversions could be applied to any type, not just numeric types. Then the language becomes a mess because no one understands how a type got to a point in the first place because it was implicitly converted at some previous point (no syntax clue for this).

Too late :-) The language is already is a mess when dealing with integer types of different size. Just some people may not be impacted too much because they are primarily relying on Int32 variables in their code.

If math operators don't do promotions and you are forced to use them, this problem goes away (though you have to write a bit more, but just that)

Yeah, thanks :-) So I'm supposed to be typing more when compared to using other programming languages. What was one of the main "selling" points of Crystal again?

If you know your program will likely reach Int64 or BigInt levels, use those types in the first place. That's my advice. Then using "lower" types will always work: Int64 + Int32 => Int64, BigInt + Int32 => Int32, etc. You just have to keep the bigger type on the left hand side.

Right. And we also probably need to use cool pointer arithmetics more :-) Because, you know, using pointers correctly will always work. And they are perfectly safe if used correctly.

asterite commented 8 years ago

Well, many modern languages don't do arithmetic promotion, for example Rust, Go and Swift (the three top languages according to GitHub). I guess they realized that such implicit conversions are nice because you get to type less, but are probably bug prone in many other ways.

Now, we didn't chose not to have this in Crystal. We just kicked the problem for later. But, seeing that these modern languages don't do implicit conversions makes me think that we are also doing it well. We should maybe even want to remove operations between ints and floats (floats on the right-hand side), and require an explicit cast there too.

Can you please provide some code where this is a problem? This is an honest question, I don't usually write math programs and most of the existing programs are also not math oriented, so concrete examples will help to improve this.

ozra commented 8 years ago

[ed: sorry about the long post beating the almost dead horse]

Yes, you're right @asterite , if going that road, it's better to generalize it also. The link you posted to Julia's coercion/conversion and promotion the other day is interesting, I've solved similar issues in a JS-transpiler back in the days. The point of implicitness causing confusion is always valid of course, but I think one should treat this area of language with the same caution as macros. Macros are by far the most "dangerous" part allowing very confusing things to happen, but we don't want that, so we don't ;-) If someone implements weird implicits in a lib, chances are people won't use it until they've leaned it up. Unless it's a very specific SDL case in which case it might be very helpful.

An alternative is to to make conversions and promotions simplified by having conv/promote funcs but their only invoked if a certain prefix is used on the specific values. This makes for lean code (provided the prefix is a one char symbol) utilizing the pre-defined conv/promotion rules, while still being explicit in that something "seemingly magic" is happening. Now I'm of course thinking more about conversion to match a target param of a func or so; for promotion the prefix would have to be applied to an entire call or grouping or so, so a bit trickier to get cleanly looking.

But the only case where I see this really being useful is for literals, because otherwise we'll tend to type the vars correctly (but allowing typedecl for locals would be very helpful for this!). Examples: @a : UInt8 = 47, some_func_taking_f64(47), I know I've mentioned it before, and the BigInt / AnyType argument of course comes up, but I'm talking inbuilt literals which reasonably doesn't really have a specific type (apart from I32 implicitly being imposed when not suffixed). This does require a type prioritization rule of course (if a fn matches Int, what should the literal become? Reasonably Int32 based on current. But if fn(Int16 | Int64)? Then Int64 would be more reasonable. So, of course for literal "reverse-inference" a try scheme like I32 => I64 => Float64 => Float32 => I16 => I8 or something (ignoring UInts in the example), but then again, on a 64-bit arch 64bit first would be more reasonable, but I know the varying bit-width depending on arch is not wanted in Crystal, because of the replies in #2158.

Well, I wanted to air my thoughts in any event.

Regarding the top-langs, I've chosen the crystal path over those ;-) (even though as you know I'm hacking my own syntax on top of it, but I'm talking the essence of Crystal here :-) ) Go doesn't even have generics, so the choices they've made aren't necessarily the way of designing a language in my opinion. Of course, there might also very well be merit to it. Just saying, I don't consider those as authorities. I'd be fine with no implicit convs as long as literals aren't "pre-typed" implicitly to I32. After all, I have been bitten by confusing errors in C++ because of coercion and strange choice of overloads by the compiler, so the problems are definitely real.

ysbaddaden commented 8 years ago

Either case (overflow or promotion) will lead to bugs. I had a bug last week in my HPACK decoder for Crystal, because I didn't promote some UInt8 bytes to Int32. I once had a bug in Ruby because an algorithm required integer overflow...

Both bugs involved dealing with binary protocols, not math, but promoting manually is easier and faster than having to fake an overflow with a modulo...

When doing math (eg. dealing with GIS geometries) I always use Float64, rarely a Float32 if I dont need much precision or much computation that would result in approximations. Most of my use cases don't involve integers, but I guess I'd always use a Int64 unless it's required.

asterite commented 8 years ago

@ozra Yes, for literals maybe something better can be done. For example in Go literals also don't have type, and they can be assigned to another numeric type as long as they fit there. I think this also applies to function calls.

Just to clarify, I'm not against improving this situation. We've improved other numeric stuff, like for example bit shifting is more correct now, and also signed vs. unsigned comparison works well. It's just that we usually fix/improve things when we are convinced that the solution is the best one. This also happened recently regarding parsing unions in JSON, I think we've reached the best solution there. And also removing ifdef in favor of macro {% if %}.

For numeric promotion I still can't see this "best solution", though I'm sure it will eventually appear.

One thought that has always been in my mind is to make Int64 the default type for numeric constants, because 64 bits machines are now the common thing. That will make overflows less likely to happen.

(I wouldn't make default numeric literals depend on the architecture, as that can lead to many bugs as well)

Another thing is that we'd probably like to handle overflows in a better way. LLVM even has intrinsics for them. One idea is to always perform math with overflow checks, and have regions of code with unchecked math. This is what C# does, though I can't remember now if it's the other way around (unchecked by default, checked if asked).

mverzilli commented 8 years ago

Based on @ysbaddaden 's points and @ssvb observations about commutativity, I would even disallow Int64 + Int32.

When the operands don't match, the compiler will help you by forcing you to think what's the best type choice at the moment of writing vs. silently introducing a potential source of difficult to trace bugs. I'm sure this would lead to easier to understand compiler errors (because it's kind of difficult or clumsy to explain that Int64 + Int32 isn't the same as Int32 + Int64).

ozra commented 8 years ago

I agree completely, it's better to go slow and get it right when the solution reveals itself, than to paint the language in to a corner. I think the Int64 as default might be a very good decision. If one wants a faast number crunching system, one does not run a 32-bit machine. Treating 32-bit arch as second citizens sounds like a good idea, and arguably any possible slow down wouldn't be too noticeable. For the cases where one does a lot of integer divisions one can explicitly use Int32 (since that's the current culprit in X86 CPU's regarding I64 speed)

Overflow checks by default, hmm, not to keen on that - I purposely write code so that headroom is available, and for RAM-reduction cases (in arrays, etc.) I'd check that explicitly. Definitely warrant some thinking, that one too...

@mverzilli - yeah, maybe the complete opposite direction is better, than something just in between (regarding non-literals ;-) ).