Rust is not a macro language for assembly

ia0 commented 2 months ago

Thanks for working on educational materials! This looks pretty good but has in my opinion a rather problematic issue.

The What is Ownership? chapter seems to assume that Rust is a macro language[^1] for assembly (which is a common misconception in system programming, the origin of most bugs, and probably the downfall of C and C++):

This chapter does not mention the operational semantics of Rust in the section about undefined behavior. Even worse, it introduces assembly to describe undefined behavior.
In the section about the stack, there is a note saying that "[the described] memory model does not fully describe how Rust actually works" (which is fine) but justifies it by saying that variables may end up in registers instead of the stack (which is bad, there are no registers in Rust).

This is a problem because it teaches that Rust is defined by its compilation artifacts (i.e. its implementation) rather than by its semantics (i.e. its specification). Over the long term, this will result in Rust not being able to change its implementation (without making existing silent bugs in user code apparent) because users depend on a particular implementation rather than the specification. Note that Hyrum's law states that eventually users will depend on a particular implementation. However, I believe that with proper tooling (like Miri) to enforce users to code against the Rust specification, we may avoid or reduce the impact of Hyrum's law. So we should also try to avoid it in teaching materials.

The suggested fix is rather simple (because the high-level teaching ideas are good):

Mention the operational semantics of Rust when talking about undefined behavior. It might be fine to say that this semantics is not complete yet but there is a growing consensus, and disagreements are usually on details that don't matter for users learning Rust. Maybe it might be good to mention MiniRust too.
Mention that the notion of stack (and heap) are components of the operational semantics. It might be fine to say that after compilation, those match the stack (and heap) in assembly to some extent. Some variables may be in registers instead of the stack. And some allocations may be done on the stack or in registers instead of the heap (e.g. a Box that is created and dropped in the same function).

[^1]: A macro language is a language defined by its compilation to target languages. This is in contrast to languages defined directly and independently of target languages, for example with an operational semantics at source level.

austin362667 commented 1 month ago

I second that.

I think it certainly needs more elaboration on semantic-level concepts that won't change too much over time.

willcrichton commented 6 days ago

I'm of a few minds about this take. On the one hand, I agree, and Ch4 is written as it is because I agree. I specifically teach the MIR-level semantics of Rust because it's the level at which the borrow checker (and Miri) think about Rust.

On the other hand, the borrow checker is only one part of Rust. Rust is, actually, a macro language for assembly. The vast majority of Rust programs will be compiled to x86 or ARM or Wasm or whatever, and executed as such. The effects of undefined behavior will ultimately appear within these settings.

Additionally, as a matter of pedagogy, most readers will not be familiar with enough programming language theory to understand the concepts or vocabulary of operational semantics. They will, however, likely understand the idea of assembly, and the idea that there's a "high-level semantics" of a Rust program and an "actual semantics" depending on the compilation target. I worry that the points you're asking to include are too nuanced / too advanced for relatively little benefit to the average reader.

ia0 commented 6 days ago

Thanks for the explanation!

Thinking about it, I was mostly worried about the target audiance writing unsafe Rust with such teaching material. But as it appears, this chapter is not meant for such audiance. I got confused because this chapter uses words like "safety", "unsafe", and "undefined behavior" which are usually used in the context of unsafe Rust. But this chapter does not have any unsafe Rust and doesn't promote usage of unsafe Rust either.

So my recommendation instead would be to use alternative wordings to avoid such confusion. Here are suggestions:

"safety" could become "correctness" or "well-behavedness"
"unsafe" could become "incorrect" or "bad"
"undefined behavior" could become "incorrect behavior" or "bad behavior"

What do you think?

willcrichton commented 6 days ago

I do still want to use some of the relevant terminology. This chapter is based on the conscious decision that Rust learners can benefit from knowing something about undefined behavior even if they don't write a single line of unsafe code. (This idea is based on our human factors research: https://dl.acm.org/doi/10.1145/3622841)

ia0 commented 6 days ago

The usage of "unsafe" in that paper seems wrong too. The paper says:

Participants frequently struggled to construct a correct counterexample to an unsafe function. For example, consider the make_separator program, shown on the right, which returns a dangling pointer to the variable default.
fn make_separator(user_str: &str) -> &str {
    if user_str == "" {
         let default = "=".repeat(10);
         &default
    } else {
        user_str
    }
}

This function is not "unsafe". It is ill-typed. The paper seems to assume that something else was written:

fn make_separator(user_str: &str) -> &str {
    if user_str == "" {
        let default = "=".repeat(10);
        unsafe { std::mem::transmute(default.as_str()) }
    } else {
        user_str
    }
}

Or maybe it assumes the program is written in C with Rust syntax, which is not something well defined.

I agree that Rust learners would benefit from understanding memory, such that they can understand the error messages of the type (and borrow) checker. But I don't think that "unsafe" or "undefined behavior" are the concepts they lack. The concepts they lack are:

What is an allocation? A contiguous span of memory, either on the heap or on the stack.
How long is an allocation valid? Different answer if on the heap or on the stack.
References must point to valid allocations.
etc

In C/C++, misunderstanding those concepts leads to undefined behavior. In Rust, misunderstanding those concepts leads to type error.

cognitive-engineering-lab / rust-book

Rust is not a macro language for assembly #186