Open Qix- opened 8 years ago
Might explain type casting here and how it's handled (e.g. -1i32 as u32
). Those conversions are to be well-defined.
// @Polygn a review of this would be super helpful.
@Qix- Okay, this is going to be quite long...
I think the intent of the syntax for defining s
igned, u
nsigned, and f
loat, is a good idea. This is short, concise, and to the point.
Everything looks good here, I like the Swift-esque syntax here, also as we discussed in #10 the array syntax will definitely help with compiler speeds.
The fact that a boolean
is effectively a typedef'd u1
will certainly (as stated) help with bit packing into areas where bit spacing is a concern, rather than packing it as an actual true/false value.
With the other default types, I think the idea is good, having full 64-bit string support would help when making large string blocks like a PGP key or some other long string of text.
As far as numeric literals go, I think the basic notation is great except I would add one thing... Where you have an FLI (Floating Point Integer) less than one (eg. 0.4543
) I think it would be a good idea to include the trailing 0 before the separator, this can help a programmer quickly identify the floating number, and I can't see any reason why would it impact compiler speeds or anything else. Food for thought.
Negative values, again, appending the left hand side 0 would help a programmer identify the less than 0 FLI.
Scientific notation looks good to me, I see no changes that need to be made.
Now this really needs its own section... I honestly, in all my years of programming have never seen a radix style as you've defined here. I can see where having explicit control over how "wide" the value is in terms of actual bit space could be SUPER helpful, especially in cases where you're controlling some low level I/O or register ops (where bit space is a valuable commodity). Definitely no changes required here
R/L values: Nothing needs to be changed. Type specifiers: Nothing needs to be changed.
As far as bit widths go? Capping them at 4096 is a good idea haha, considering (as you said) anything higher would be superfluous for modern-day computers. But of course demanding >0 bits.
As far as the actual syntax goes? I'd do something more in the realm of this:
1024:u32
1.2424:f16
0.989:f32
0xDEAD:u16
Doing this would provide more readability rather than jumbling it together, not to mention it will be easier for the compiler to parse and then tokenize.
Good idea on the builtins, as far as true/false goes.. It's interesting because I've never seen a language that implements true/false as an L-value. Could prove to be quite interesting!
Holy crap yes, no more assertions of actual int widths please! This would help improve workflow so much!
I never thought of using a tetris like method of packing data in the structs but, I like the idea as this could help find open spaces in memory and fill them quite easily. Allowing the programmer to write the optimization themselves of course will need a level of know how, but of course this language is more for the "savvy" programmer. With this kind of optimization (as you said) with uneven types packing memory spaces with different bit widths, allowing full control over the amount of bits used (quickly) will be very nice.
So, what I see you've done is basically make casting automatic thus telling me you have automatic type deduction. As far as how this property will affect compiler time? I can't say, I would assume it would be quick though.
in your example of:
foo i32 = -13
bar u32 = foo as u32
This syntax could prove to be helpful, defining foo first as an i32, then casting to u32 using as
? Yeah I like the idea of this, especially sense you could easily cast on the fly without losing or gaining any bit space (by keeping both sizes 32-Bits).
So in closing? Out of all of this? Maybe two things need rethinking by a little, not much though. I really, really like the radix notation you've thought up, that is something I would really find handy; including the auto casting and explicit type-casting!
I like the literal format idea. With the suffixes in particular, I was relying a bit on modern editors having syntax highlighting to help do this (ViM can handle Arua numeric literals quite well) but there's absolutely no denying using a colon is much, much more readable. Plus it'll remove a constraint I've not been ecstatic about that radixes with numeric characters (anything > base 10) must consist only of numbers and upper case letters. Having a non-alphanumeric break in the number will remove that silly constraint and indeed improve readability.
Nice catch, will definitely change.
Just a note for anyone else reading this:
Capping them at 4096 is a good idea
4096 is a completely arbitrary number. LLVM supports generating machine code that works with really, really big widths without using data structures. However, LLVM itself has capped the width at 8,388,607 bits.
Through testing for myself, bigger widths are not incredibly efficient (implementations like BigInteger
are going to be much more performant) so anything above 4096 is mis-using the bit widths feature that was intended in Arua (LLVM has no reason to cap it because it's not its place to be opinionated) and thus going to cause a lot of overhead in your program.
As far as requiring the leading 0.
in a floating point integer, the reason why I was hesitant about enforcing that is for calculations like this:
foo f32 = 4 * .8 / .9 + (.155 * 5.5)
versus the alternative
foo f32 = 4 * 0.8 / 0.9 + (0.155 * 5.5)
They take up more space, but I completely agree that they're more readable. This heavily relies on #17, but I think there's a good case for it.
As far as perks in optimizations go, that'd be up to the compiler mostly unless users specifically needed a C-style packed struct for compatibility (or to ensure the format of a struct between two compilations are the same). Otherwise, the user doesn't care about the struct's member layout and thus the compiler optimizes accordingly.
A note on conversions: you can certainly do
foo u32 = -5:i32
but the outcome is going to be different than
foo u32 = -5:i32 as u32
The first example simply changes types, not the value itself. The second does the equivalent of foo = abs(-5)
. as
performs logic and potentially changes the value, whereas implicit conversion (the first example) changes just the type.
The first example has more potential for throwing an error since valid conversions depend on widths, whereas using as
will compile almost always.
I've updated the original to reflect the change in suffix (15:u32
) syntax.
Numeric literals in Arua take a unique approach in terms of showing intent.
"Intent"
Arua aims to show intent. "Intent" can mean a few things, but with numerics (as well as all primitive types) we aim to convey how the number is to be used.
Some examples of the intent of numbers:
The point of intent is to represent these values as close to their intended size as possible.
Primitives Refresher
All types in Arua can be boiled down to a single numeric type, or a collection of numeric types. There are three primitive types:
u
nsigned integeri
ntegerf
loating point numberAs well, each of the primitive types come in a few collective or descriptive states:
[T]
) <#10>!T
) <#7>T?
) <#7>(T,)
) <#9>As well, these types can be
typedef
'd (#3) to create new types.Decay of Common Types
In common languages, types such as
boolean
exist to express a single binary value. In Arua, it's simplyu1
. The typeu1
shows immediate intent, and has the added bonus of being easily packed and optimized if used within structures.Another common type is
string
. Arua has native unicode support (#11) and exposes such functionality throughtypedef
s of[u8]
,[u16]
,[u32]
, and[u64]
asstr8
(alias
ed tostr
),str16
,str32
andstr64
respectively. This has the added bonus of allowing functions that take arrays of these types (or of any type) to also take strings, and allows the ability to index them using the subscript operator.There are three representations of a numeric literal:
12345.34
)144.3e76
)0xDEADBEEF
)Basic Notation
Basic notation is your simple notation. It supports both integers and floats in the following formats:
1234
1234.567
.1234
Negative values are prefixed with a
-
:-1234
-1234.567
-.1234
Scientific Notation
Scientific notation is similar to simple notation, but allows for either base-2 or base-10 exponents to be specified:
1234e15
-1234 * 10 ^ 15
1234b24
-1234 ^ 24
Radix Notation
Radix notation expands upon the classic hexadecimal notation to allow for any base to be used in place of the
0x
up to 36 ([0-9][A-Z]
).0x
is still treated as16x
.0xAA
/16xAA
= 1701x0000
= 4 (unary/talley system)2x0110
= 6 (binary)5x4311
= 5818x666
= 43810x123
= 12320xAG33FB0
= 69171022036xYZX1
= 1632853All radixes spaces with character domains containing letters (hexadecimal, etc.) require that such letters are uppercase. This is to disambiguate literal format specifiers (below).Radix numbers cannot be negative; however, since signed numbers are two's-complementary they can be represented as negative by ensuring the first bit is set to
1
and the type specifier (below) isi
.Literal Format
Each numeric literal has an optional format suffix it can supply. In the event a format is not specified, one of two things occurs:
Literal formats consist of a type specifier character and a bit width.
The type specifiers are as follow:
i
- signedi
ntegeru
-u
nsigned integerf
-f
loating point numberAs of now, preliminary concept implementations of numeric literals caps bit widths at 4096 as anything beyond that is simply absurd for classical computers (as opposed to, say, quantum computers). Bit widths must be greater than 0.
Floating point width specifiers must be one of
16
,32
,64
, or128
.Literal values and their suffixes are separated by a colon (
:
).Some example numbers with their format suffixes:
1234:u16
0xDEADBEEF:u32
36xZY:u64
0.1:f128
Builtins
Currently, there are two builtins:
true
andfalse
.const true u1 = 1:u1
const false u1 = 0:u1
Perks in Semantics
At first, the advantages of such extensive notations and width specification may not be clear. However, bitwise operations benefit greatly from such flexibility:
AruaDoc comment RFC #13
Unlike C-family languages, no longer do you have to guess or assert how big an integer is. Just use it how you need to and let the compiler optimize for you.
Perks in Optimization
As well as semantic benefits, when numeric types are clustered together (e.g. in
struct
s), we can do some pretty extensive "tetris"-like packing optimizations for data that won't be persisted. It also gives us flexibility to optimize for size, or for speed, since we can perform some tricky alignment strategies or generate bitwise instructions in order to access those properties.Since we perform these optimizations ourselves, we can then begin to generate C-family
struct
source code with bit-fields or other alignment optimizations in place to create compatible data structures with the same identifiers given to the properties to be compiled into existing code bases, allowing very flexible protocol implementations to be built for example.Optimizations can also occur on systems with uncommon word sizes or systems that might provide better alignment strategies.
Bounds and Defined Behavior
Unlike C, integer overflow and conversion are well defined.
Conversion
The golden rule is to remember that type casting performs logic; assignment does not. Below are some examples and their C equivalents.
Signed to Unsigned (assignment):
Signed to Unsigned (typecast):
Signed to Signed narrow (assignment):
Signed to Signed narrowing (typecast):