A hello from a similar project 👋🏻

federicotdn commented 10 months ago

Hey! I found your project some weeks ago and found it super interesting - I am doing something similar, but in Go: https://github.com/federicotdn/pimacs.

Your project is more active than mine, but I thought it would be interesting to share ideas. Go and Rust are very different languages of course but some challenges are more or less the same regardless of the language. Some of the conversations For example I've seen the following topics discussed on this repo, that I've also thought about!

Handling if strings, multibyte vs. unibyte, using plain UTF-8 vs. Emacs' custom UTF-8-like encoding
Importing Elisp files from the Emacs codebase into Rune
Concurrency! (a fun one to tackle in Go, given goroutines + channels)
Implementation of an Elisp reader, bytecode interpreter and/or JIT compiler
Testing in general (using Emacs' ERT?)

I've written about some of my design decisions here: https://github.com/federicotdn/pimacs/blob/main/etc/design.md. As a part of the project, I've written a Python script called extract.py that creates a JSON file with information about all the subroutine decalarations in Emacs, perhaps that could be useful for Rune as well! (https://github.com/federicotdn/pimacs/blob/main/test/data/emacs_subroutines.json)

I'm aware that a GitHub issue is not a great medium for these types of discussions, so please feel free to close it, as it is not really an "issue" with Rune itself. Cheers!

CeleritasCelery commented 10 months ago

Hi @federicotdn! it is so cool to see your project! I think there could be a lot for me to learn from your implementation. I read through your design doc and ran some sample programs. I think Github issues are a fine medium for discussion, so if it is okay with you I have a few questions/observations.

why go?

Why did you choose go for this project? having a builtin in garbage collector seems useful. Also goroutines can make parallelism easier. However I feel like the tricky part of adding goroutines to Emacs is not implementing the interpreter to support it, but defining the semantics in elisp. I have written about that here and am curious what your thoughts are about using threads in Emacs. We currently support multi-threaded elisp in the most basic sense with the go function. But most of the elisp semantics have yet to be nailed down.

elisp string compatibility

It looks like you are trying to be "bug compatible" with Emacs because you say that "The current main design guide for Pimacs is to implement an Elisp interpreter that allows for running any Elisp script without modifications". How do you plan to handle mutable strings, given that go strings are immutable? Also Curious what you plan to do about the encoding of "raw bytes" in multibyte strings?

debugger

The interpreter in Rune was initially structured similar to pimacs, in that we don't use longjump but instead return error types and propagate them. This was how we handled backtraces and displaying errors. But I observed that in GNU Emacs you can enter the debugger when you get an error. You can also resume from this error point. That means you can't unwind the stack on error and need the context (the specpdl stack) without unwinding. You also need to know if there is a condition case to catch the error or not without unwinding first. We are in the process of reworking the runtime to support a debugger and not only unwinding the stack when the debugger is not active. Though I do think your use of defer to automatically clean up the specpdl stack is quite clever and will save your project from a lot of bugs.

tagging

How are your LispObject's defined? I don't understand go well enough to determine that. Are you using some sort of tagging scheme so they will fit in a machine word? How is nil represented?

passing many arguments

When calling a LispFnM, which takes an unbounded number of arguments, where are you storing the arguments for the call? Are you allocating an array each time? I ask because that is something we have been looking recently.

Emacs subroutines.json

I thought this was a very cool idea. I currently don't have any way to track which subroutines we are missing, or if we have the wrong number of arguments etc. Your solution makes it easy to upgrade Emacs version because you can always find new functions or ones that have changed. I eventually want have a testrunner that can run directly against GNU Emacs to compare functionality.

federicotdn commented 10 months ago

(I'm reading a bit the project code and your blog posts, I'll get back to you!)

CeleritasCelery commented 10 months ago

We also have a design doc, but it is not as well organized as yours.

federicotdn commented 10 months ago

Ok getting back to you!

Why Go?

Mainly, I find Go programming fun so I thought it was a cool project to test it out. And I also thought, re-implementing Emacs is such a massive project, that I'm going to need all the help I can get. The GC is great to have OOTB, plus the syntax of Go itself is quite easy, so re-writing all the subroutines would be (in theory) a quicker process than with a more complex language. It also has a good library ecosystem, which helps as well. I was definitely focusing more than anything on developer efficiency. The concurrency utilities are of course very handy as well, but I wasn't thinking of them so much when I started (given that concurrency in Emacs is not thaaat important after all).

I read your blog post about concurrency, and the design doc! It is definitely a hard problem to nail down. Personally I went with having some things shared among goroutines, and others not: shared/global:

obarray for symbols interned during startup (protected by mutex)
global buffers list (protected by mutex)
all symbols
all other Elisp objects

goroutine-local:

the stack(s)
the current buffer
obarray for symbols interned after goroutine starts

So with this approach, each goroutine also has its own obarray. This allows binding variables like lexical-binding per-goroutine. However this may have been better to do via symbol value lookup, though I feel like the logic for it in Emacs is already complex enough. With the current system, a symbol interned in the local obarray shadows the one in the global obarray. Maybe both systems are equivalent...

I haven't really tested this design so I don't know if it would work in practice. Your approach seems much more though-out (and probably tested as well). Have you looked into how Emacs does it? I believe it does have a per-thread struct where it stores some state, but maybe it's not using it for much.

Elisp string compatibility

Yes, they are immutable! One can actually cheat by using the unsafe module (which gives access to the raw data being pointed at), but I decided not to, because Go asserts that all strings need to be immutable.

So instead my lispString is basically:

type lispString struct {
    valMut []byte
    val    string
    size_  int
}

where val is used by default to store the string value. If the user tries to modify any part of the string, then the value is copied over to valMut (as bytes) and val is set to "" (there's no way of setting a string to null in Go). So it's kind of a copy-on-write.

In Go, string is basically an immutable []byte. The reason I did not use []byte and nothing else is that so many library utilities operate on string, that it is actually very convenient to start off with string, as I assume that most strings in Elisp are actually never mutated. Converting []byte to string requires copying bytes, so once the lispString has been mutated, converting it to a string is a bit expensive.

To keep the struct as small as possible, I use size_ to signal different things: if its value is -1 then that means the string is unibyte - i.e. you can know the length by just doing len(val) or len(valMut) (that is, just counting the bytes).

If size is 0, then it means that the number of runes need to be counted (maybe because the string bytes were modified recently). If size is larger than 0 then that is effectively the cached number of runes in the string. In both cases, the string is multibyte. Here I'm always using UTF-8 (the Go stdlib has utilities for it).

This was as small as I could get the strings without potentially breaking things inside the Go runtime itself. I bet you can do something much more compact with Rust. How would you describe the one in Rune at the moment?

Regarding Emacs' custom UTF-8 variant (lets call it EU8) - I thought about it and decided that it wasn't worth the complexity. By using plain UTF-8 I can use Go's existing libraries (plus, the language uses UTF-8 for string literals as well), so everything fits in nicely.

Do you know how common it is to store raw bytes within multibyte strings? I feel like it can't be too normal, right? The Emacs docs also state that:

Emacs extends this range with codepoints in the range #x110000..#x3FFFFF, which it uses for representing characters that are not unified with Unicode and raw 8-bit bytes that cannot be interpreted as characters.

what would be examples of "characters not unified with Unicode"? I wonder if Emacs makes use of this range itself. But in any case, I have a strong feeling that normal UTF-8 should cover the majority of use cases.

But yeah, strings and encoding is a very cool topic to think about.

Debugger

That is an interesting point about the debugger. My Rust knowledge is basic so I couldn't find how you are currently managing the stack aspects in Rune. But if you say it could be a problem, then you are probably right, I didn't get around to implementing a debugger yet.

For catching errors, I do think I managed to get it working though: when calling throw, the function scans the execution stacks and checks if a tag was pushed into it earlier during execution: https://github.com/federicotdn/pimacs/blob/main/core/eval.go#L276

Thank you for the comment about defer. It makes me miss a bit C++'s RAII features.

Tagging

For this it took a veeeery easy solution, but probably very memory wasteful as well. Basically all my Elisp objects are structs that also satisfy an interface called lispObject. This post by Russ Cox explains the internals of interfaces a bit: https://research.swtch.com/interfaces . But essentially interfaces are a two-word pair, one pointer pointing at the underlying data, and another pointing pointing at a structure (itab) containing type information about the underlying data (e.g. lispString).

I pass around lispObject everywhere, which did simplify some things. I believe the Rune equivalent would be Rt<GcObj> right?

I find the Rune approach very cool though, like in:

#[defun(name = "%")]
pub(crate) fn remainder(x: i64, y: i64) -> i64 {
    // TODO: Handle markers
    x % y
}

The fact that you can operate directly with i64 is much more convenient. In Pimacs this would be:

defun (ec *execContext) modulo(x, y lispObject) (lispObject, error) {
    if !integerp(x) {
        return ec.wrongTypeArgument(ec.s.integerp, x)
    } else if !integerp(y) {
        return ec.wrongTypeArgument(ec.s.integerp, y)
    }

    return xInteger(x) % xInteger(y), nil
}

Also using tagged pointers is clearly much more memory efficient!

In my case, nil is just a lispSymbol pointing at itself: https://github.com/federicotdn/pimacs/blob/main/core/symbols.go#L124 I also set it as an execContext-level variable, so that it can be accessed from "anywhere" by just writing ec.nil_.

Passing too many arguments

For this I convert the Elisp list of arguments into a Go slice of lispObject ([]lispObject), here: https://github.com/federicotdn/pimacs/blob/main/core/eval.go#L16 . This function itself is called by eval_sub.

This is a very basic approach though, maybe there is a better way of doing it.

emacs_subroutines.json

Thanks. I also figured that I would "tie" Pimacs to a certain Emacs version - this is configured here: https://github.com/federicotdn/pimacs/blob/main/tools/extract/config.json - this file determines the contents of emacs_subroutines.json, but also the version of the Elisp files that I copy over! So then everythin is consistent. How does Rune decide which version of Emacs to target?

UI/UX

I also share your idea of separating the core part of the editor from the UI, like in the xi editor. However I've looked around extensively and the status of GUI libraries is not great, for Go at least. The problem is that you need an extremely flexible library - after all, you need to implement all Emacs features, which do not really follow any the guidelines of any native GUI toolkit.

So then, you maybe can use something a bit more low-level, like some immediate-mode library, but then you need to implement a lot of things yourself. Text rendering in particular seems like a scary topic. I am curious as if you've had ideas on this yet.

A UI for the terminal seems a bit easier to achieve, and still very useful (thankfully). Maybe that would be a good starting point for both projects.

Gap buffer vs. Rope

I agree that the Gap buffer is fast enough. The fact that they are easy to implement is a huge plus. But additionally, I think every design decision that makes Rune/Pimacs internally more similar to Emacs is also a good thing in the long term; it can potentially make it easier to port over more code from Emacs (even though technically how a buffer is implemented should not really matter, but in practice things tend to be more complicated). The Emacs codebase has a lot of surprises.

Bytecode / source interpeter

My understanding is that currently Rune can read Elisp source code, and it can also compile it to bytecode and interpret it from there. Is that correct? Can Rune read .elc files generated by Emacs? Would that even be desireable?

Project philosophy

I am very suprised at how both of the projects are very similar in terms of design and purpose! It seems like we arrived to very similar ideas independently; using very different tools.

CeleritasCelery commented 10 months ago

I am going to respond in multiple comments as I get time.

threading

You seemed have arrived at a similar conclusion for what should be shared vs thread local. In Rune this is how we split it up.

shared globally

interned obarray (mutex)
buffer list (mutex)
interned symbol functions (atomic)

All interned symbols are shared between all threads. They (currently) have an atomic field where the function pointer is stored, so functions are shared as well. values of symbols are thread local and stored in a hashmap. I think properties will be shared as well.

thread local

all lisp variables
uninterned symbols
symbol values
the stack

We don't share any lisp objects between threads, because doing so makes reasoning about your program and resolving bugs much harder. We also have a local obarray for uninterned symbols. The value of symbol is first looked up in the threads variable cache, and if not found it requests the value from the "main" thread. If it is not found there, it is a void variable.

As I mentioned in my post, buffers are kind of unique in that they are shared but only a single thread can have it open at any time. This is controlled by a mutex. I still have not decided how to handle buffer locals though (does each thread get a copy or does accessing a buffer local share it with that thread).

CeleritasCelery commented 10 months ago

String type

Strings in rune are defined very similarly to pimacs.

https://github.com/CeleritasCelery/rune/blob/2d03e7bd016e28ae41eb96dd2828a0369b661549/src/core/object/string.rs#L11-L24

We have an enum that can either be a UTF-8 string or a raw byte array. I agree with you that I don't think supporting raw bytes is worth the extra complexity.

Do you know how common it is to store raw bytes within multibyte strings?

I don't think it is that common, but Emacs has support for it. If you want to fully compatible with emacs lisp you would need to handle it though. For example:

(string-to-multibyte (unibyte-string 128))

what would be examples of "characters not unified with Unicode"?

These are the raw bytes. For example if you go to a buffer and eval (insert-byte 128 1) it will insert a raw byte. Then call M-x describe-char on that character it will report it is code point #x3fff80 which is outside valid unicode range. It actually uses a special encoding so that raw bytes are two bytes instead of four.

tagging

the #[defun] macro is something we lifted from remacs. It is really nice because it automatically handles type checking and conversion of all the arguments and return value, which let's you write really succinct code like you saw with remainder.

I pass around lispObject everywhere, which did simplify some things. I believe the Rune equivalent would be Rt right?

A LispObject is just a GcObj (which is a type alias for Gc<Object>). Rt represents an object that is reachable from a root (and hence can survive calls to garbage_collect). Since we don't have a garbage collector, we have to manage all that ourselves. Having a garbage collector like Go saves a lot of implementation complexity.

CeleritasCelery commented 10 months ago

UI

I also share your idea of separating the core part of the editor from the UI, like in the xi editor. However I've looked around extensively and the status of GUI libraries is not great, for Go at least. The problem is that you need an extremely flexible library - after all, you need to implement all Emacs features, which do not really follow any the guidelines of any native GUI toolkit.

I will admit that I am pretty ignorant about GUI's. The Rust ecosystem does not have great options either. There are some new projects like xilem that are exploring this space. Though it seems like most of the requirements for an Emacs GUI are pretty simple. We need fast text layout and rendering in multiple panes (what Emacs would call windows). It doesn't have fancy widgets or graphics or animations. I think using something lower level (like GTK) is probably the way to go.

As far as TUI's go, I am trying to avoid going down that path first. A lot of the functionality in Emacs is hamstrung by being a "TUI first" program. For example it treating C-i and tab as the same key, and treating control keybindings as ascii inputs in general. I am not opposed to having a TUI, but I would like to make Emacs "GUI first" and undo some of the legacy around being run in a terminal.

Bytecode / source interpreter

Rune is currently to the point where the interpreter can bootstrap enough elisp to load the byte compiler (written in lisp) and then compile the functions. The bytecode VM is also implemented and is run as well. We don't load bytecode functions (.elc) directly, but it would be trivial to implement that. However given that rust is a memory safe language, I would like to have some way to verify bytecode that we load before we execute it (I have crashed Emacs many times playing around with bytecode).

Project philosophy

I do find it interesting that we arrived at the same approach on many things, especially around concurrency. I think it would be so cool to have the capability in Emacs.

Qkessler commented 10 months ago

This has been cool to read! Thanks both!

federicotdn commented 9 months ago

Amazing, thank you for taking the time to reply with so many details. I'm going to be keeping an eye on this project to see how it progresses. If some day I decide to resume learning Rust, maybe I can try submitting a PR here!

CeleritasCelery commented 9 months ago

@federicotdn I have been thinking there might be a way for the projects to collaborate. I have been wanting to create a tool to test the implementation against GNU Emacs, since that is accurate measure of correctness. I was planning on creating a tool that can extract the arguments for builtin function from Emacs and then generate inputs for them. We would than feed this into both GNU Emacs and Rune and ensure that the inputs match. Essentially property testing against Emacs. That will hopefully help flush out a bunch of issues, and anything that is found can be added to normal unit tests.

This got me thinking that you already have a great start to this with your extract tool in pimacs! It can already parse the source files of GNU Emacs format them nicely in JSON. We could build on this to make a tool that read the JSON and generates the inputs and then runs it against both GNU Emacs and the other implementation. This could be used for both Rune and pimacs. What do you think?

federicotdn commented 9 months ago

Yes! That sounds like a good idea. I'm curious as to how you would generate inputs for each function though, as you would need to know the type of each argument beforehand, and also have a set of expected outputs for those arguments. So basically you would need hand-written test cases for each function. And at that point, automatic extraction of all functions stops being so useful 🤔 . Unless you have something else in mind.

In Pimacs I was using the JSON file to quickly test that any subroutine defined in Pimacs, with a name identical to one in Emacs, had the same set of attributes: arity, lack of return value, etc. (well actually I only got to test the arity but the rest would not be so difficult to add). (https://github.com/federicotdn/pimacs/blob/main/core/exec_context_test.go#L75)

But yeah, in any case of course we can bring the tool into Rune 👍🏻

CeleritasCelery commented 9 months ago

I'm curious as to how you would generate inputs for each function though, as you would need to know the type of each argument beforehand, and also have a set of expected outputs for those arguments.

The idea is that you could generate random input for a function and then feed it into both Emacs and the alternative implementation and make sure they get the same output. So the expected output is whatever Emacs returns (including errors). We already know the arity of functions thanks to your script, and that would be enough for a first pass. But rune also has type annotations on the builtin functions. That could be fed into the JSON and then used to have more specific inputs.

For example if we picked the + function. We could feed it random data and make sure that we return that same thing as Emacs. Most of these would be wrong-type-argument errors, but some would be valid by chance. If we used the type information from the JSON we could also limit the inputs to only numbers, and this would get much more interesting results. In fact we could run it in both modes.

I see this as a low effort way to flush out bugs, because we don't need to define manual tests or types.

federicotdn commented 9 months ago

Aah ok, I see, then more or less the flow would be:

For each subroutine S defined in Rune:
- Check S against emacs_subroutines.json to check arity and potentially other attributes
- Generate random inputs based on the Rust parameter types for S
- Evaluate S(args) and check that it returns the same as in Emacs itself (or errors out in the same way)

To me the fact that you are already using Emacs itself in this process makes the JSON file less valuable, since you can probably evaluate some Elisp code within Emacs that will return you the subroutine metadata anyway.

For example you could do:

For each subroutine S defined in Rune:
- Start up Emacs and look up S. If S is not defined there, skip this iteration
- If S is found, use help-function-arlist (or equivalent) to check that the Rune <-> Emacs signatures are compatible
- Generate random inputs based on the Rust parameter types
- Evaluate S(args) and check that it returns the same as in Emacs itself (or errors out in the same way)

federicotdn commented 9 months ago

Or maybe you could do two separate things: one where you use the JSON file as a very quick arity/subroutines name check, and then a slower one using the second method described above. Because I can imagine that for ~1500 subroutines, doing multiple tests for both Rune and Emacs might get slow!

CeleritasCelery commented 9 months ago

You are right that we could just do this all from Rust. Have it parse the arity and types and then generate the input forms, run the Emacsen, and report the results. But the idea is that if you instead write out a JSON with that information you can feed that into a python program that does the generation and comparison of functions. That way Rune could have a small program to create the JSON and pimacs could have one to create the JSON. This let's both projects share the common python test framework to compare to GNU Emacs.

federicotdn commented 9 months ago

Sounds good. I can start in the following weeks by bringing over the Python script to this repo and outputting the JSON somewhere. In this regard:

Where should I assume the source code for Emacs is?
Should we skip some older Emacs files? For example, I was skipping some MS-DOS .c files because they had some quirks in terms of their DEFUNs, that I didn't want to waste time on. I can check again what this was.
Any preferred output directory that we can .gitignore?

CeleritasCelery commented 9 months ago

I would think that the Emacs source code as well as the actually Emacs executables should be arguments to the script. That way if a project is tracking a different version (we are currently 29.1) there are no builtin assumptions.

I think is safe to ignore MS-DOS or anything else that does not seem very useful. I am hoping that as I dive into this that there will be more useless functions that I can just throw out. The easiest code to maintain is code that does not exist!

As far the output directory goes, Rust usually puts things in target/ but we could create another one called output/ if we wanted.

CeleritasCelery / rune