RFC: Numeric Literals - Githubissues

Qix- commented 8 years ago

Numeric literals in Arua take a unique approach in terms of showing intent.

"Intent"

Arua aims to show intent. "Intent" can mean a few things, but with numerics (as well as all primitive types) we aim to convey how the number is to be used.

Some examples of the intent of numbers:

A boolean is either true or false. It fits into 1 bit.
An IPv4 port is anywhere from 0 to 65535. It fits into 16 bits.
A basic terminal color is one of 8 values. It fits into 3 bits.
A git commit lookup starts with the first byte in the SHA digest. It fits into 8 bits.

The point of intent is to represent these values as close to their intended size as possible.

Primitives Refresher

All types in Arua can be boiled down to a single numeric type, or a collection of numeric types. There are three primitive types:

unsigned integer
signed integer
floating point number

As well, each of the primitive types come in a few collective or descriptive states:

array ([T]) <#10>
mutable (!T) <#7>
optional (T?) <#7>
tuple ((T,)) <#9>

As well, these types can be typedef'd (#3) to create new types.

Decay of Common Types

In common languages, types such as boolean exist to express a single binary value. In Arua, it's simply u1. The type u1 shows immediate intent, and has the added bonus of being easily packed and optimized if used within structures.

Another common type is string. Arua has native unicode support (#11) and exposes such functionality through typedefs of [u8], [u16], [u32], and [u64] as str8 (aliased to str), str16, str32 and str64 respectively. This has the added bonus of allowing functions that take arrays of these types (or of any type) to also take strings, and allows the ability to index them using the subscript operator.

See #3 for a better explanation of typeof and alias.

Literal Notation

There are three representations of a numeric literal:

Basic (12345.34)
Scientific (144.3e76)
Radix (0xDEADBEEF)
Basic Notation

Basic notation is your simple notation. It supports both integers and floats in the following formats:

1234
1234.567
.1234

Negative values are prefixed with a -:

-1234
-1234.567
-.1234
Scientific Notation

Scientific notation is similar to simple notation, but allows for either base-2 or base-10 exponents to be specified:

1234e15 - 1234 * 10 ^ 15
1234b24 - 1234 ^ 24
Radix Notation

Radix notation expands upon the classic hexadecimal notation to allow for any base to be used in place of the 0x up to 36 ([0-9][A-Z]). 0x is still treated as 16x.

0xAA / 16xAA = 170
1x0000 = 4 (unary/talley system)
2x0110 = 6 (binary)
5x4311 = 581
8x666 = 438
10x123 = 123
20xAG33FB0 = 691710220
36xYZX1 = 1632853

~~All radixes spaces with character domains containing letters (hexadecimal, etc.) require that such letters are uppercase. This is to disambiguate literal format specifiers (below).~~

Radix numbers cannot be negative; however, since signed numbers are two's-complementary they can be represented as negative by ensuring the first bit is set to 1 and the type specifier (below) is i.

Literal Format

Each numeric literal has an optional format suffix it can supply. In the event a format is not specified, one of two things occurs:

R-values assume the type and width of the L-value
L-values cause an error (rare cases where this actually happens)

Literal formats consist of a type specifier character and a bit width.

The type specifiers are as follow:

i - signed integer
u - unsigned integer
f - floating point number

As of now, preliminary concept implementations of numeric literals caps bit widths at 4096 as anything beyond that is simply absurd for classical computers (as opposed to, say, quantum computers). Bit widths must be greater than 0.

Floating point width specifiers must be one of 16, 32, 64, or 128.

Literal values and their suffixes are separated by a colon (:).

Some example numbers with their format suffixes:

1234:u16
`166.9:f64
0xDEADBEEF:u32
36xZY:u64
0.1:f128
Builtins

Currently, there are two builtins: true and false.

const true u1 = 1:u1
const false u1 = 0:u1
Perks in Semantics

At first, the advantages of such extensive notations and width specification may not be clear. However, bitwise operations benefit greatly from such flexibility:

## Allow writes from 
fn allowWrites(mode u16)
    return mode | 8x222:u9
    # -- or ---
    return mode | 2x010010010:u9

^{AruaDoc comment RFC #13}

Unlike C-family languages, no longer do you have to guess or assert how big an integer is. Just use it how you need to and let the compiler optimize for you.

Perks in Optimization

Some of these points are better described in the Bit-Field RFC at #6 (https://github.com/arua-lang/proposal/issues/6#issuecomment-222622717)

As well as semantic benefits, when numeric types are clustered together (e.g. in structs), we can do some pretty extensive "tetris"-like packing optimizations for data that won't be persisted. It also gives us flexibility to optimize for size, or for speed, since we can perform some tricky alignment strategies or generate bitwise instructions in order to access those properties.

Since we perform these optimizations ourselves, we can then begin to generate C-family struct source code with bit-fields or other alignment optimizations in place to create compatible data structures with the same identifiers given to the properties to be compiled into existing code bases, allowing very flexible protocol implementations to be built for example.

Optimizations can also occur on systems with uncommon word sizes or systems that might provide better alignment strategies.

Bounds and Defined Behavior

Unlike C, integer overflow and conversion are well defined.

Conversion

The golden rule is to remember that type casting performs logic; assignment does not. Below are some examples and their C equivalents.

Signed to Unsigned (assignment):

foo i32 = -15
bar u32 = foo #

int foo  = -15;
unsigned int bar = *((unsigned int *)&foo); // 4294967281 - preserves sign bit but now read as unsigned integer

Signed to Unsigned (typecast):

foo i32 = -15
bar u32 = foo as u32

int foo = -15;
unsigned int bar = abs(foo);

Signed to Signed narrow (assignment):

foo i32 = -15
bar i16 = foo # error - cannot narrow

Signed to Signed narrowing (typecast):

foo i32 = -70000
bar i16 = foo as i16 # -4464 - sign is preserved, but modulo (2 ** sizeof(type) / 2) - 1 is used

int foo = -15;
short bar = (short) foo;

Qix- commented 8 years ago

Might explain type casting here and how it's handled (e.g. -1i32 as u32). Those conversions are to be well-defined.

Qix- commented 8 years ago

// @Polygn a review of this would be super helpful.

corbin-r commented 8 years ago

@Qix- Okay, this is going to be quite long...

Primitives

I think the intent of the syntax for defining signed, unsigned, and float, is a good idea. This is short, concise, and to the point.

Type qualifiers

Everything looks good here, I like the Swift-esque syntax here, also as we discussed in #10 the array syntax will definitely help with compiler speeds.

Decay of common types

The fact that a boolean is effectively a typedef'd u1 will certainly (as stated) help with bit packing into areas where bit spacing is a concern, rather than packing it as an actual true/false value. With the other default types, I think the idea is good, having full 64-bit string support would help when making large string blocks like a PGP key or some other long string of text.

Literal types

As far as numeric literals go, I think the basic notation is great except I would add one thing... Where you have an FLI (Floating Point Integer) less than one (eg. 0.4543) I think it would be a good idea to include the trailing 0 before the separator, this can help a programmer quickly identify the floating number, and I can't see any reason why would it impact compiler speeds or anything else. Food for thought.

Negative values, again, appending the left hand side 0 would help a programmer identify the less than 0 FLI.

Scientific notation looks good to me, I see no changes that need to be made.

Radix notation

Now this really needs its own section... I honestly, in all my years of programming have never seen a radix style as you've defined here. I can see where having explicit control over how "wide" the value is in terms of actual bit space could be SUPER helpful, especially in cases where you're controlling some low level I/O or register ops (where bit space is a valuable commodity). Definitely no changes required here

Literal formats

R/L values: Nothing needs to be changed. Type specifiers: Nothing needs to be changed.

As far as bit widths go? Capping them at 4096 is a good idea haha, considering (as you said) anything higher would be superfluous for modern-day computers. But of course demanding >0 bits.

As far as the actual syntax goes? I'd do something more in the realm of this:

1024:u32
1.2424:f16
0.989:f32
0xDEAD:u16

Doing this would provide more readability rather than jumbling it together, not to mention it will be easier for the compiler to parse and then tokenize.

Built-ins

Good idea on the builtins, as far as true/false goes.. It's interesting because I've never seen a language that implements true/false as an L-value. Could prove to be quite interesting!

Perks in semantics

Holy crap yes, no more assertions of actual int widths please! This would help improve workflow so much!

Perks in optimization

I never thought of using a tetris like method of packing data in the structs but, I like the idea as this could help find open spaces in memory and fill them quite easily. Allowing the programmer to write the optimization themselves of course will need a level of know how, but of course this language is more for the "savvy" programmer. With this kind of optimization (as you said) with uneven types packing memory spaces with different bit widths, allowing full control over the amount of bits used (quickly) will be very nice.

Conversion

So, what I see you've done is basically make casting automatic thus telling me you have automatic type deduction. As far as how this property will affect compiler time? I can't say, I would assume it would be quick though.

in your example of:

foo i32 = -13
bar u32 = foo as u32

This syntax could prove to be helpful, defining foo first as an i32, then casting to u32 using as? Yeah I like the idea of this, especially sense you could easily cast on the fly without losing or gaining any bit space (by keeping both sizes 32-Bits).

So in closing? Out of all of this? Maybe two things need rethinking by a little, not much though. I really, really like the radix notation you've thought up, that is something I would really find handy; including the auto casting and explicit type-casting!

Qix- commented 8 years ago

I like the literal format idea. With the suffixes in particular, I was relying a bit on modern editors having syntax highlighting to help do this (ViM can handle Arua numeric literals quite well) but there's absolutely no denying using a colon is much, much more readable. Plus it'll remove a constraint I've not been ecstatic about that radixes with numeric characters (anything > base 10) must consist only of numbers and upper case letters. Having a non-alphanumeric break in the number will remove that silly constraint and indeed improve readability.

Nice catch, will definitely change.

Just a note for anyone else reading this:

Capping them at 4096 is a good idea

4096 is a completely arbitrary number. LLVM supports generating machine code that works with really, really big widths without using data structures. However, LLVM itself has capped the width at 8,388,607 bits.

Through testing for myself, bigger widths are not incredibly efficient (implementations like BigInteger are going to be much more performant) so anything above 4096 is mis-using the bit widths feature that was intended in Arua (LLVM has no reason to cap it because it's not its place to be opinionated) and thus going to cause a lot of overhead in your program.

As far as requiring the leading 0. in a floating point integer, the reason why I was hesitant about enforcing that is for calculations like this:

foo f32 = 4 * .8 / .9 + (.155 * 5.5)

versus the alternative

foo f32 = 4 * 0.8 / 0.9 + (0.155 * 5.5)

They take up more space, but I completely agree that they're more readable. This heavily relies on #17, but I think there's a good case for it.

As far as perks in optimizations go, that'd be up to the compiler mostly unless users specifically needed a C-style packed struct for compatibility (or to ensure the format of a struct between two compilations are the same). Otherwise, the user doesn't care about the struct's member layout and thus the compiler optimizes accordingly.

A note on conversions: you can certainly do

foo u32 = -5:i32

but the outcome is going to be different than

foo u32  = -5:i32 as u32

The first example simply changes types, not the value itself. The second does the equivalent of foo = abs(-5). as performs logic and potentially changes the value, whereas implicit conversion (the first example) changes just the type.

The first example has more potential for throwing an error since valid conversions depend on widths, whereas using as will compile almost always.

Qix- commented 8 years ago

I've updated the original to reflect the change in suffix (15:u32) syntax.

Qix- / arua-meta

RFC: Numeric Literals #14

"Intent"

Primitives Refresher

Decay of Common Types

Literal Notation

Basic Notation

Scientific Notation

Radix Notation

Literal Format

Builtins

Perks in Semantics

Perks in Optimization

Bounds and Defined Behavior

Conversion

Primitives

Type qualifiers

Decay of common types

Literal types

Radix notation

Literal formats

Built-ins

Perks in semantics

Perks in optimization

Conversion