Bit vector type - Githubissues

AtomCrafty commented 2 years ago

A quick prelude to explain the motivation behind this: I was just going through some of the open issues again and #52 in combination with this comment made me think about the actual semantics of a ranged index operation, and by extension the concatenation operator and the way we represent untyped data.

The Issues

In various scenarios we have to deal with information where we know the size, but not the intended interpretation. For example general purpose registers can hold both integer and floating point values. As it stands we use the type unsigned<?> in these cases (see the RISCV reference implementation), and going by the wiki, concatenation (and hence the ranged array access) is even explicitly specified to always result in an unsigned integer. I see some issues with this representation and would like to propose an alternative.

Let's start with the obvious: the data is not necessarily an integer, so it makes no sense to treat it as such. Doing so means we can perform e.g. addition or multiplication on values where these operations have no meaningful interpretation (like a flags register), or - even worse - they do have a meaningful interpretation, but are applied with incorrect semantics. Imagine adding two general purpose registers that are meant to contain floats, but the addition uses unsigned integer semantics instead. Of course this would be an error by the implementor, but one that has been facilitated by unfortunate decisions by the people declaring that register as an unsigned integer in the first place (and therefore by the language design that didn't provide a suitable alternative).

As a matter of fact, CoreDSL currently doesn't even provide an equivalent to C++'s reinterpret_cast, so once a value is labeled as one type, it is nontrivial to reinterpret its bit pattern in a different way (it's still possible using unions, but in a very roundabout way).

Of course we currently don't support floats, so the above examples might not seem that relevant at present time, but this is a fundamental design choice that will be very difficult to course-correct if we don't tackle it now, as implementors will start using unsigned<XLEN>.

The Solution

I have considered multiple approaches to solve this issue. The first one was to use unions to allow a register to hold values of different types. This however requires all possible stored types to be listed explicitly in the base ISA, which generally wouldn't know about all of them. Imagine the base ISA declared union RegisterValue { unsigned<XLEN> intVal; float floatVal; }, but a derived ISA wants to store doubles as well. Also unions can't be used to solve the concatenation conundrum.

So after some deliberation, I have arrived at the conclusion that the best way forward is to introduce an explicit "untyped" type that can be explicitly converted to all other types of the same bit width. A possible name could be bitvector<?>, or maybe bits<?> to save some characters.

In a lot of past discussions, the notion of "reduce implicit assumptions" and "let the implementor make it explicit" have been prevalent, and these apply here as well, since the implementor would have to explicitly cast the register value to the expected type on each read. Arithmetic operations like addition would not be defined on bit vectors, so implementors would need to explicitly state whether they require integer or float semantics. We can however still allow bitwise operations on them. And as an added bonus, reinterpret casts would just require a double cast of the pattern (target_type)(bits)source_expr.

Specification

So what would need to change in the specification?

Introduce a new bits<?> type and add the corresponding type specifier to the grammar.
Allow explicit casts from bits<?> to all types of exactly the same width. No sign or zero extensions.
Allow implicit and explicit casts from every (non-void) type to a bit vector of the same length.
- Also allow a special (bits) cast that works like (signed) and (unsigned), in that it preserves the width without the need to explicitly state it.
Only bitwise operations are allowed on bit vectors (not, and, or, xor, shift, (in-)equality, bit access, bit range, concatenation)
Concatenation expressions are now defined as bits<m> × bits<n> → bits<m+n>.
- Operands are implicitly cast, result is always untyped.
Range accesses therefore also result in an untyped value.

jopperm commented 2 years ago

Intriguing proposal, thanks for writing this up! I'm afraid I won't have time to fully dive into this until next week, so just two quick initial thoughts:

could you provide some code samples based on the assumption that the canonical RISC-V architectural state elements (and instruction fields) are declared as bits<n>?
would you allow bit-operations only on the signless types?

AtomCrafty commented 2 years ago

I don't understand the second question. Bit vectors wouldn't have a notion of signedness. Are you talking about disallowing bitwise operations on the existing signed<?> type? That seems out of scope for this proposal.

jopperm commented 2 years ago

By signless, I mean the proposed bits<n> types, which are neither signed nor unsigned. My question was whether you would disallow bitwise ops on signed and unsigned values, because I feel that the result of unsigned<4> & signed<17> is also not really obvious, but yeah, seems like a separate issue.

AtomCrafty commented 2 years ago

Yeah I'll need to think about that before I can answer the question.

AtomCrafty commented 2 years ago

Alright, I went through the RV32I description and migrated it to bit vectors. You can view the diff here. As you can see, the only changes are a bazillion (unsigned) casts whenever instruction operands are accessed.

Now, I agree this looks horrible, but I have a solution for that as well. I will however open a separate issue for that, since it is a fairly significant change.

Other than that, I'm quite happy because this exercise actually proved my point. There are a couple of instructions that access the operand imm both via a sign cast and without one (for example here). It is exactly this type of seemingly innocent mistake that would be avoided by making the types explicit. Now granted, in these instances it doesn't really make a difference because we're only checking whether the result of the modulo is zero, but the actual intermediate result depends on whether imm is signed or unsigned.

eyck commented 2 years ago

I'm not in favor of this data type as well as the operands (#69) extension. First it bloats the language. Secondly: one of the early design decisions was: everything is an integer. Even HDLs define arithemtic operations on bitvectors interpreted as integers. The float argument does not apply: there is no float within a processor. You always deal with some representation lieke IEEE753, Unum, bfloat, and many more. Those are always a set of integers describing the various parts of a real number. So basically this ends at a struct.

jopperm commented 2 years ago

On second thought, I also have doubts that the extra complexity of distinguishing bits and unsigned really pays off here. I agree that using unsigned as the default type even for otherwise untyped data is a bit ugly, but on the other hand the operations prohibited on bits all work exactly the same if you apply them to unsigned-casted bits-values.

So the question is, does your discomfort with the current solution boil down to a slight misnomer/reuse of the existing keyword, or do you think we need three different base types in CoreDSL?

AtomCrafty commented 2 years ago

I don't see it as three base types. There is only one single type from a hardware perspective, that being bits. unsigned, signed and well as all struct types and potential future float representations - should we ever implement native support for those - are just various interpretations of bits that determine which operations are valid on those bits. My issue is that we treat an arbitrary one of these - namely unsigned - as somehow being the "true" interpretation.

All of this is purely a frontend question and the backend shouldn't even need to know what the concept of signedness means. For the backend, everything is a bit vector and it is the operation itself that determines the semantics. So rather than performing a "less than" operation on two signed values, the backend should be performing a "signed less than" operation on two bit vectors.

So in essence, I believe that - in the context of CoreDSL - types are purely a language-level concept and hence are all reduced to bits by the frontend. A lot of distinct language-level operations can be reduced to simple bit swizzling. For example sign extension is just "take the first bit n times, then the rest of the bits" or a struct field access is just "take this bit range and discard the rest".

Even HDLs define arithemtic operations on bitvectors interpreted as integers.

This is exactly what I am talking about. The backend operates at the same abstraction level as HDLs, so it performs operations on bit vectors. Whether a less-than is signed or unsigned does not depend on the types, but on the operation itself. Meanwhile CoreDSL is a much more abstract language, in which the notion of "adding two untyped bit vectors" makes no sense and should therefore not be allowed.

AtomCrafty commented 2 years ago

This should illustrate what I mean:

CoreDSL Lowering

AtomCrafty commented 2 years ago

In today's meeting, @jopperm convinced me that this probably doesn't add enough value to justify the effort. I am still in favor of the explicit operand declarations though.

Minres / CoreDSL

Bit vector type #62

The Issues

The Solution

Specification