dotnet / roslyn

The Roslyn .NET compiler provides C# and Visual Basic languages with rich code analysis APIs.
https://docs.microsoft.com/dotnet/csharp/roslyn-sdk/
MIT License
19.02k stars 4.03k forks source link

C# Compiler is not consistent with the CLR for the underlying representation of bool #24652

Closed tannergooding closed 6 years ago

tannergooding commented 6 years ago

Issue

Today the C# compiler does not emit code handling for bool that is consistent with the underlying representation of bool in the CLR.

The C# specification states:

The bool type represents boolean logical quantities. The possible values of type bool are true and false.

The spec also indicates that it is 1-byte. Specifically that the result of the sizeof(bool) expression is 1.

While the ECMA 335 specification states:

A CLI Boolean type occupies 1 byte in memory. A bit pattern of all zeroes denotes a value of false. A bit pattern with any one or more bits set (analogous to a non-zero integer) denotes a value of true. For the purpose of stack operations boolean values are treated as unsigned 1-byte integers (§III.1.1.1).

This can lead to confusion about the handling and cause various inconsistencies in various edge cases (see https://github.com/dotnet/coreclr/pull/16138#discussion_r165256495, for such a thread).

Proposal

It would be good to determine:

Current Behavior

The majority of the behaviors below actually match the CLR expectation that a bool can be more than just 0 or 1. However some of the behaviors (such as &&) does not match this expectation and can cause issues when interoping with any code that would assume otherwise (generally this is some kind of interop or unsafe code).

tannergooding commented 6 years ago

Talked with @MadsTorgersen briefly and it was determined logging an issue for this would be good.

There is the normal concern of this being a breaking change, since any change to the handling of the bools will cause code reliant on that behavior to go down a different code path (this, at the very least, impacts the && code pattern, which treats any number with the first bit set as true, and any other value as false).

Also FYI. @jaredpar, @mikedn, @jkotas

gafter commented 6 years ago

The C# language specification is not specific to any particular runtime platform. The only observable consequence of placing something like that in the specification would be the behavior you get when you overlay a bool and a byte in unsafe code. Specifying that would be a peculiar departure from the normal situation in which the specification avoids telling you in detail the way unsafe code works.

tannergooding commented 6 years ago

The only observable consequence of placing something like that in the specification would be the behavior you get when you overlay a bool and a byte in unsafe code.

Only if you limited such code to C#.

It is entirely possible for C# to be consuming code produced by another language where the bool value conforms to the CLR definition but doesn't line up with the C# implementation. In which case updating the spec would cause the behavior in that scenario to be well-defined instead of undefined or implementation-specific.

gafter commented 6 years ago

See also #12211 #14262 and https://web.archive.org/web/20160318193105/https://roslyn.codeplex.com/workitem/224

@tannergooding What languages do you have in mind that use a different representation for bool?

tannergooding commented 6 years ago

ILASM, in general. C++/CLI for desktop, although it could realistically be any language compiled to IL (such as C or C++). There are several programs for converting between LLVM bytecode and IL bytecode, for example. extern methods are not unsafe and can also return a bool value that is not 0 or 1.

As a hypothetical example:

.method public hidebysig static 
        bool MyTrue () cil managed 
{
    .maxstack 8

    IL_0000: ldc.i4.2
    IL_0001: ret
}

The C# language would see this as static bool MyTrue() and the value returned would be true, according to the ECMA-335 spec (Common Language Infrastructure specification). However, C# would handle it incorrectly for the case of &&.

gafter commented 6 years ago

See the last entry in https://web.archive.org/web/20160318193105/https://roslyn.codeplex.com/workitem/224 for how we resolved this in 2014. The resolution was that we would document it (presumably, in https://github.com/dotnet/roslyn/tree/master/docs/compilers) but we do not appear to have done so.

tannergooding commented 6 years ago

I don't believe the linked resolution is exactly correct.

The last comment is making a statement that the runtime bools have more than two states. However, it could easily be argued that they do in fact only have two states (true, which is a bit pattern with any one or more bits set; or false, which is a bit pattern of all zeros).

For the runtime, a boolean with a bit pattern of 10 is equivalent to a bit pattern of 01, they represent the same state.

tannergooding commented 6 years ago

I'm also not sure it will be clear what a "non-standard" bool is without clearly defining the underlying bit representation of true and false. Otherwise, the exact definition of a "standard" bool is still dependent on the underlying runtime and the implementation of a given C# compiler.

For example, the current Roslyn implementation is generally inline with the runtime definition, except for a couple of weird scenarios with the short-circuit comparisons where it treats it as true has a bit representation with the least significant big set and false has a representation with the least significant bit cleared.

HaloFour commented 6 years ago

So wouldn't this also mean that any C# assembly exposing a bool wouldn't be CLS compliant because of violations of the ECMA spec?

The argument that C# is independent of the CLR is a copout. C# exists in an ecosystem with other languages bound by the common contract that is the CLS/CLR. More than once has the argument been made against various C# proposals given that they would not play well across that ecosystem. It's ludicrous to expect that every other compiler that might want to target the CLR will have to know about this backwards behavior in order for it to produce assemblies that will work as expected.

jaredpar commented 6 years ago

Can the compiler be updated? Given that we are emitting for the CLR, we would make our handling match the expected representations for the two boolean values (0/not 0) of the underlying platform

The compiler is operating as designed here and has been for the entirety of its existence. This issue has been brought up several times throughout the history of the compiler. We considered the arguments but ultimately have decided to continue on with this approach.

This issues should be moved to the C# language repository. The compiler is operating as designed here. If there is a call for further clarity in the spec then csharplang is the right place to drive that.

tannergooding commented 6 years ago

@jaredpar, I had opened it here, rather than csharplang, at the request of @MadsTorgersen.

This issue has been brought up several times throughout the history of the compiler. We considered the arguments but ultimately have decided to continue on with this approach.

It also continues being brought up as a pain point every few months. I believe it would help if the compiler was at least consistent, but I'm not sure you could call it that.

The current behavior can currently be described as:

tannergooding commented 6 years ago

Basically, what I'm seeing is that the C# spec states there are two values, but the implementation is handling it as if there are more.

That is

Instead, it is handling it as 0 and depends on the underlying bit pattern, which I would view as inconsistent with both the C# and the runtime specifications.

gafter commented 6 years ago

@jaredpar I'm reopening this and assigning it to myself to document the compiler behavior.

jaredpar commented 6 years ago

It also continues being brought up as a pain point every few months. I believe it would help if the compiler was at least consistent, but I'm not sure you could call it that.

The job of the compiler is to emit code that will behave correctly in the face of boolean values that are 1 or 0. The behavior of other values is inconsequential.

Instead, it is handling it as 0 and depends on the underlying bit pattern, which I would view as inconsistent with both the C# and the runtime specifications.

Disagree. The compiler correctly handles the cases that it supports: 0 and 1.

tannergooding commented 6 years ago

The compiler correctly handles the cases that it supports: 0 and 1.

This needs to be documented, if not in the language spec, then in a compiler implementation details spec.

In my opinion, The language spec should explicitly state that handling of values with a bit representation different than true or false is implementation defined. It should also either explicitly list the bit representation of true and false or also call out that they are implementation defined, but non-identical, values.

Additionally, it would be a nice to have to have something (probably in a compiler implementation details specification) explicitly stating that:

jaredpar commented 6 years ago

This needs to be documented, if not in the language spec, then in a compiler implementation details spec.

This has been documented several times, some in this various repo. Unfortunately searching for terms like "bool" and "true" is meaningless given our issue size. I'm fine with this being documented as behavior of the compiler. Saying our code gen is inconsistent though is incorrect.

tannergooding commented 6 years ago

Unfortunately searching for terms like "bool" and "true" is meaningless given our issue size.

Would it be reasonable to do the following in the language spec?

Saying our code gen is inconsistent though is incorrect.

I was saying it is inconsistent with the spec, which I still think it is, at least until the spec is updated. The compiler is implementing true (1), false (0), and implementation defined handling (2-255). I would view this as explicitly handling three states rather than 2 (which would have to be implemented as true is x, false is not x, as anything else results in a third state, even if that state is undefined or implementation defined)

gafter commented 6 years ago

Roslyn documentation belongs in the Roslyn repo. I will add some appropriate documentation for this issue in docs/compilers.

Joe4evr commented 6 years ago

The only observable consequence of placing something like that in the specification would be the behavior you get when you overlay a bool and a byte in unsafe code.

You wouldn't even need any unsafe, explicit layout can do that trick just fine already.

gafter commented 6 years ago

@Joe4evr Although LayoutKind.Explicit does not require unsafe, it is unsafe. There is no other way to explain the behavior of explicit layout by reference to the C# language specification. It is too late for the C# compiler to require it only be used in an unsafe context.

GrabYourPitchforks commented 6 years ago

@Joe4evr @gafter I don't think it would make sense to require the unsafe keyword anyway. Explicit layouts are completely verifiable IL as long as you're not overlaying references (type safety violation) or private fields of structs (visibility violation). Blitting simple primitives on top of each other doesn't violate either of these conditions.

gafter commented 6 years ago

@GrabYourPitchforks Verifiable perhaps, but such code can undermine the compiler's implementation invariants.

benaadams commented 6 years ago

ceq says 0 and 1

Compares two values. If they are equal, the integer value 1 (int32) is pushed onto the evaluation stack; otherwise 0 (int32) is pushed onto the evaluation stack.

Though ~0 (i.e. 0xFF) is more useful for bit manipulation; 2-254 are dubious values