SubX in SubX: Transforming uses of string literals (C)

akkartik commented 5 years ago

Context: https://github.com/akkartik/mu/pull/23 (just the top-level description)

Plan: https://github.com/akkartik/mu/commit/d4a244268841e8e912c98f4587095b701aa5c292#commitcomment-33558279

To see the failing tests:

$ ./subx translate 0*.subx apps/subx-common.subx apps/dquotes.subx -o apps/dquotes  &&  \
      ./subx run apps/dquotes test

In this PR we only care about making test-skip-string-... tests pass.

Feel free to copy code from parse-string in the same file to make the tests pass.

Contact me if you'd like to contribute. Commit access freely given.

charles-l commented 5 years ago

ok -- working on implementing this.

akkartik commented 5 years ago

Wow, that was quick!

akkartik commented 5 years ago

Wonderful. 3 gone, 4 to go.

I cleaned up the segment header syntax last night (#33). I think I'll start on the final frontier next: survey.subx!

charles-l commented 5 years ago

Ok -- finished the skip-string-in-stream function.

I think I'll start on the final frontier next: survey.subx!

Whoo! Excited to finish this off and start building some higher-level abstractions. Working in assembly can be rewarding and frustrating at the same time... but I think I'm ready to have my if statements back soon :P

akkartik commented 5 years ago

Indeed! I'd love to hear your experiences working with and particularly debugging SubX programs. Is there anything in particular that you've been getting sick of?

I've been thinking for a while about how to add higher-level syntax above SubX, and I don't have any strong ideas yet, so we can try out any ideas you have. Having a bootstrapped substrate is a great foundation for ensuring any HLL we come up with stays hackable. (I just want to make sure people using the high-level language are still having to periodically think about low-level details. That feels necessary to ensure we don't again segregate people building a compiler from people using it.)

charles-l commented 5 years ago

Oh, I have a lot of partially worked out ideas. This is a brain dump of what I've been thinking about recently:

I've been very inspired by the paper "Next-Paradigm Programming Languages: What Will They Look Like and What Changes Will They Bring?." There are two ideas I really like from this paper. The first is the idea of monotonic programming: adding code doesn't change any previous meaning (which fits nicely with the concept of layers). I think a formal specification of the layer interfaces might allow this. The second is that next-generation programming languages will be very declarative while still allowing an escape hatch into lower-level imperative code/custom data structure specification when needed. I've been playing around with logic programming languages and wonder how far a language can be pushed to solve "practical" problems (e.g. games, text editors, web servers). SQL is a big inspiration here, because it's declarative, while still being fast. It's syntax is clugy and it doesn't have an escape hatch, but it's practical. I'd love to see a high-level datalog-like logic language that allows you to safely drop into a lower level language (as low as assembly!) and somehow verify that the code written is correct. Not quite sure how to do this, though... It might just have to be similar to what Rust does by making unsafe code explicit, but still a black box.
Type system should be implemented in a macro for maximum hackability.
Define a new ABI standard. The C ABI is garbage for automatic reflection (why do we still have to parse header files to know what methods exist in a .so?!). Maybe the ELF symbol name for a function could encode type information for parameters, or DWARF symbols can be used somehow? Or there could be a format for specifying a type segment?
Algebraic side effects look interesting, and seem to fit nicely with traces. Traces could be effects that are collected at the top of a program and used for testing/logs etc. Haven't played with them enough yet to know the downsides.

So these are my mad-scientist, crazy ideas for what a next-gen programming language could look like. I'd like to hear your thoughts on them :) -- I realize not all of them will necessarily work with Mu.

Near term items that will be useful for a language right above an assembly level:

Automatic register allocation (maybe enforce SSA or A-normal form). Specifying registers by hand is error prone and repetitive. It's easy to paint yourself into a corner (when you need EAX:EDX for division but store some accumulator in them for example). Just having bindings would be really nice for ergonomics.
If statements are hard to recognize structurally, so those would be nice to have soon too. Loops are fine with gotos IMO. We could use infinite loops with continue/break instead...
Tree-based matching and transformation for building compiler passes. Erlang-like bit-syntax might be enough for now. Structs would be nice as well.
In terms of syntax, I like Lisp's homoiconicity ('cuz macros), but I don't have strong opinions about syntax, either. I think the whitespace-based s-expressions of wart's syntax is nice. I like the way order of operations is handled from a hacker perspective, but I'm not sure that would appeal to most folks. Doing something forth-like for stack-based syntax might be easier to implement early on, though...

akkartik commented 5 years ago

Interesting, I'll check out that paper. Having been in academia before, I tend to be skeptical of such grand titles :) But Smaragdakis is a good researcher.

We should probably continue bottom-up and build simpler mechanisms before more complex ones. So these ideas of yours seem worth focusing on for now:

Type system. It's hard to do a type system with just macros because macros are intrinsically local rewrites, whereas type-checking requires a global view of the program. But this is my top priority as well. In particular, my biggest pain point (now that we have string literals) is that we use raw offsets to access fields in structs. I'd like to introduce some syntax for defining struct members and their types so that we can refer to field names and have them automatically translated to the right offset.
Control flow. While I agree we can do fine with gotos, I'm actually quite pleased with my design for structured programming in the previous Mu prototype. Have you looked at it yet? It's in the top-level directory of this repo (search for 'conditional'). Lower priority, but it gives me the warm fuzzies to only use the word goto to signal something out of the ordinary :)
Register allocation. I actually kinda like us doing manual register allocation in a low-level language. Trying to automate it risks ending up with a 'thick' compiler, so I'd like to explore doing without. What I want though is for the compiler to check my allocation, and raise the alarm on bugs like this. That kinda goes with the existing design of SubX, emphasizing error-checking over syntax.

C used to have a register keyword, and it was never really very useful, since C also has automatic register allocation. I'd like instead to try a scheme where the compiler never assigns variables to registers by default, but the programmer can bind a variable to a register when defining it. You can have multiple variables sharing the same register, and you'd be responsible for making sure their lifetimes don't overlap. The compiler would be responsible for ensuring that you never write one variable to a register and then read a different variable from the same register, along all possible control flow paths. Such a register verifier would be 50-60% of the work of a register allocator. But it would keep the programmer in control.

Putting all these ideas together, here's my proposal for the next higher level language above SubX (maybe we can call it Mu again, and retire my old prototype):

Still statement oriented. Most statements will translate to a single SubX instruction. Some may translate to several, but the instructions for statements are never intermingled. There's no attempt at automatic compiler optimization. The translation process is always completely predictable.
Add syntax for defining local and global variables with types. Prerequisite: syntax for defining types. Product types (structs), sum types (type-safe tagged unions), maybe also enums.
Add macros to more tersely specify ModR/M and SIB bytes. Local variables have one standard form, global variables have another.
Create macros for pointer dereferencing, struct access and array indexing, and disallow the lower-level operations they expand to. Now the compiler can make sure that you only ever do a struct access when the register or memory location contains a struct. And so on.
Control flow using {, }, break and loop.
Always check bounds on array indexing.
Use fat pointers for detecting use-after-free errors. This is an idea from this paper. Basically every pointer spans 8 bytes, including a 4-byte allocation id (or allocid). The memory allocator always adds an allocid to the payload it allocates, and also writes the allocid to the pointer it returns to user code. Copying pointers copies allocids around as well. All pointer dereferences first compare pointer allocid with payload allocid and error/abort if there's a mismatch.
Pointers can only point to the start of heap allocations. We may also need a more short-lived address type for other uses, but it would be constrained, couldn't be stored in structs and so on.

All this still ends us up at a fairly low-level language. But it feels like it could be a better C:

Needs only a simple translator.
Not portable, so no undefined behavior. Translation to x86 machine code is fairly transparent. You can always tell what instructions a statement translates to, and therefore what behavior that results in.
Strongly type safe and memory safe. No possibility of reading random numbers as pointers. But still plenty of fine-grained control for most things one may want to do, without introducing too much memory usage and runtime overhead.

That would be a good foundation for the next higher level of language. I'd like to have forks beyond this point, for example one repo with SubX+Mu+μLisp and another with SubX+Mu+μPy. Zero compatibility with existing languages, just providing familiar syntax for people with different tastes.

Hmm, I hadn't really seen all this laid out until now. What do you think?

akkartik commented 5 years ago

Define a new ABI standard. The C ABI is garbage for automatic reflection (why do we still have to parse header files to know what methods exist in a .so?!). Maybe the ELF symbol name for a function could encode type information for parameters, or DWARF symbols can be used somehow? Or there could be a format for specifying a type segment?

This feels more on the OS side of things, and deserves a separate comment. I've been focusing on the language side of Mu lately, but I also periodically try to think about bootstrapping an OS from scratch. Not just out of masochism :) -- the reason goes back to my vision for testability.

Here is a test in my old Mu prototype that allows me to check the state of a blocking function: reading from a keyboard. I want to be able to say things like this. Set up a thread to run with this code and these input/output streams, run it for up to x cycles, inspect its state, assert that it's blocked on this input stream. Add some data to that input stream, run it for some more cycles, check that it made such and such progress, and now it's blocked again. Add some more data to the stream. And so on.

Now that original prototype is just a toy. I used it for teaching programming but it's a simple interpreter and too dog slow for anything serious. But I think about what it would take to build a non-toy with these properties, and how I can gain the expertise to build it. So far SubX uses an off-the-shelf Linux kernel. I've played a bit with compiling the kernel from scratch, and have some ideas of creating a repo of the sources and ripping like 90% of the code out. Then I'd like to improve the interface for spawning processes to something that lets me do the above introspections. Neither fork() nor clone() quite achieves these ends. Not in an efficient manner, at least. I want to be able to distinguish being blocked on a read of the disk that will eventually complete from being blocked on a socket that's not going to get any more data. So I need more than the running/sleeping state of a process. I need to know what it's sleeping on.

Anyway, it's a bit far afield from your idea but there's a common theme of introspection/reflection, and I wanted to say that I don't plan to stick to existing interfaces indefinitely. I'm certainly open to such ideas.

difranco commented 5 years ago

@charles-l

Oh, I have a lot of partially worked out ideas. This is a brain dump of what I've been thinking about recently:

1. I've been very inspired by the paper "[Next-Paradigm Programming Languages: What Will
   They Look Like and What Changes Will They Bring?](https://arxiv.org/pdf/1905.00402.pdf)."
   There are two ideas I really like from this paper. The first is the idea of monotonic programming: adding code doesn't change any previous meaning (which fits nicely with the concept of layers). I think a formal specification of the layer interfaces might allow this.
   The second is that next-generation programming languages will be very declarative while still allowing an escape hatch into lower-level imperative code/custom data structure specification when needed. I've been playing around with logic programming languages and wonder how far a language can be pushed to solve "practical" problems (e.g. games, text editors, web servers). SQL is a big inspiration here, because it's declarative, while still being fast. It's syntax is clugy and it doesn't have an escape hatch, but it's practical.
   I'd love to see a high-level datalog-like logic language that allows you to safely drop into a lower level language (as low as assembly!) and somehow verify that the code written is correct. Not quite sure how to do this, though... It might just have to be similar to what Rust does by making unsafe code explicit, but still a black box.

By the way, I've been working on a roughly probabilistic logic language built around some insights on how to do the evaluation efficiently with adaptive strategies. Should be able to boil purely declarative code down to dataflows that can run quickly on many types of hardware.

https://www.sciencedirect.com/science/article/abs/pii/S0888613X18301610 http://ceur-ws.org/Vol-2219/paper8.pdf https://github.com/difranco/fifth/blob/master/README.md

2. Type system should be implemented in a macro for maximum hackability.

Something I find interesting about logic programming and superset of it that I'm working on is that it subsumes the kinds of computations you'd normally use a type system for into how evaluation in general works because computing with sets and relations is the default.

So these are my mad-scientist, crazy ideas for what a next-gen programming language could look like. I'd like to hear your thoughts on them :) -- I realize not all of them will necessarily work with Mu.

Interesting to see some convergent thinking from a long way off happening here.

charles-l commented 5 years ago

Control flow. While I agree we can do fine with gotos, I'm actually quite pleased with my design for structured programming in the previous Mu prototype. Have you looked at it yet? It's in the top-level directory of this repo (search for 'conditional'). Lower priority, but it gives me the warm fuzzies to only use the word goto to signal something out of the ordinary :)

Ah, I didn't see that -- yeah, I like how this clearly maps 1-to-1 with the underlying assembly instructions.

Register allocation. I actually kinda like us doing manual register allocation in a low-level language. Trying to automate it risks ending up with a 'thick' compiler, so I'd like to explore doing without. What I want though is for the compiler to check my allocation, and raise the alarm on bugs like this.

Yeah, I can see the appeal of keeping it simple. So is the plan to have registers and local variables, and the programmer explicitly manages the registers, and gets lightweight error checking for that? So the onus is on the programmer to know which registers will be used for certain operations (e.g. div)? I guess that does keep the implementation complexity down...

Strongly type safe and memory safe

I really like the way you've laid the plan for memory safety out. If the perf impact is minimal enough (which it looks like it will be), it could be a feature that lets mu eat C's lunch.

charles-l commented 5 years ago

@difranco oh wow -- this is very interesting stuff. I'm going to have to spend some time reading a bit more to comprehend it, but it seems like you've got a way to dynamically optimize the query while it runs?

akkartik commented 5 years ago

So is the plan to have registers and local variables, and the programmer explicitly manages the registers, and gets lightweight error checking for that? So the onus is on the programmer to know which registers will be used for certain operations (e.g. div)?

div, exactly. Special cases in the instruction set aren't hidden, the programmer has to be aware of them.

I'm imagining something like the following syntax:

define a local variable on the stack: var x : int
define a local variable on a register: var x/EAX : int

akkartik commented 5 years ago

Fixed conflicts locally and merged to master.

akkartik commented 5 years ago

I really like the way you've laid the plan for memory safety out. If the perf impact is minimal enough (which it looks like it will be), it could be a feature that lets mu eat C's lunch.

I don't want to understate the performance impact, particularly for protecting against use-after-free. The current prototype for fat pointers (I call them handles there) seems to require 6 instructions (within the 'inline' comments) to dereference a pointer. The 6 instructions include a load, but it will hopefully be rarely taken and so easy for the branch predictor. I'm also assuming that the target of the jump can be in a cold segment of memory that's rarely brought into the caches. There are a few such :abort blocks scattered through the SubX codebase.

Anyway, it may be possible to optimize it down from here by someone with more practice/experience. This is just a baseline.

akkartik commented 5 years ago

After I wrote my previous comment I noticed my prototype had a bug. The baseline is now 9 instructions for a single dereference :/

However, I've been assuming a macro that reads a register and updates the exact same register:

ECX/result <- lookup(ECX)

But I could clobber one additional register, so the macro looks like:

ECX/result, EDX/clobbered <- lookup(ECX)

Then I can do it in 5 instructions.

Here are the two versions, both assuming the input and output are in reg1. Single register:

push *reg1  # handle->alloc_id
reg1 = *(reg1+4)  # handle->address or payload
push reg1  # handle->address or payload
reg1 = *reg1  # payload->alloc_id
compare reg1 and *(ESP+4)
abort if not equal
pop to reg1  # handle->address or payload
add 4 to ESP  # discard handle->alloc_id
add 4 to reg1  # skip payload->alloc_id

With one additional register:

reg2 = *reg1  # reg1->alloc_id
reg1 = *(reg1+4)  # reg1->address or payload
compare reg2 and *reg1  # payload->alloc_id
abort if not equal
add 4 to reg1  # skip alloc_id of payload

akkartik / mu

SubX in SubX: Transforming uses of string literals (C) #32