BrooksPatton / learning-rust

Learning Rust from the official Rust book
MIT License
35 stars 5 forks source link

`str` `&str` and `String` #2

Closed dannyfritz closed 6 years ago

dannyfritz commented 6 years ago

These took me a while to understand. But String is like your Java class type. &str is a reference to some UTF-8 characters in memory. And str is UTF-8 characters as a value.

String::from() is really useful to convert other types into the more flexible String. .into() is also very useful in general for type casting, but is confusing until you learn more about traits and Rust's typecasting inference system.

http://www.ameyalokare.com/rust/2017/10/12/rust-str-vs-String.html

BrooksPatton commented 6 years ago

Awesome, thanks Danny!

fschutt commented 6 years ago

@BrooksPatton I just wanted to comment on this... the article is correct, but doesn't really explain why and when you want references and when you want Strings. This is also slightly leaning into chapter 4 about borrowing, but I think the book misses the technical explanation of why the match temperature_symbol wasn't working. I almost expected you to run into this, so that was a bit mean from my side, but I was on my phone, sorry.

You said that you know C, right? So I assume that you know what a pointer is. A pointer is just a memory location like 0x0BADC0DE. A pointer is usually 32 bit (4 byte) or 64 bit (8 byte) wide, copying them around is very cheap. A reference (in Rust) is nothing else that a pointer, with the guarantee that the pointer is valid (initialized to the correct type) and not 0x00000000 (null pointer).

Understanding str and &str

Now let's understand where the pointer points to. For this it's important to understand the difference between stack and heap memory (to understand why Rust complained that you can't return a &String from a function):

When a program starts, it just runs the assembly instructions from top to bottom. A program, at compile time, has to encode the size and memory layout at compile time, meaning that it has to know all the sizes of the types that it wants to put on the stack. For a pointer, this is known (32 or 64 bit). But a String can be any size, maybe it's empty, maybe it contains 10000 characters, the size of a String can change. This isn't known until we actually run the program, so we can't put it on the stack.

Here is an example program:

fn main() {
    let a = 5_u8;
    my_print_function(&a);
}

fn my_print_function(what_to_print: &i32) {
    println!("{}", *what_to_print);
}

Note: the * operator "dereferences" the what_to_print variable so that we don't print the memory location, but the actual number.

When the program runs, the first thing it does is to "push" the number "5" onto the stack:

image

Now, the program encounters the function my_print_function. Each function has a so-called "stack frame". You already encountered this when you viewed a "backtrace". A backtrace contains all the names of stack frames (i.e. usually the function name), starting from main():

image

The final assembly code looks something like this (pseudo-assembly):

.main:
    push 5;                       ; compiler knows at compile time: variable has the adress 0x101
    call my_print_function 0x101
    ret

.my_print_function @ref_to_i32:
    call println [ref_to_i32]     ; dereference the adress and call println
    ret

Now, that the function completes, ret is executed. It "unwinds" the stack, meaning that all the memory used up for the 5 is free again for use by other programs:

image

When the adress of main is reached, the program exits. Now why am I telling you this? Well, a str (pronounced ess-tee-arr) has a known size. Do not confuse it with String. So, the compiler does this:

image

Note: str is not an object, but a primitive, but it helps to think of it as an object.

In this case, 0x101 would be an str with a length of 1. In Rust, an str always carries it's length with it, to prevent against overflow, in C, strings are terminated with \0 instead. But this has caused lots of problems in the past, so Rust doesn't do that. Now f is the actual str, but we can't copy a str around. The compiler will still call my_print_function with 0x101, but what happens now: the println function reads the length of the string (at 0x101) and then reads the next X bytes (in this case only 1 byte) and pushes it to the programs stdout.

This is what a &str is, a pointer to the start of an str primitive. We can't copy the str around, for reasons you'll understand later, but we can copy the &str, since the compiler knows the size of it (32 or 64 bit). Note: the pointer is copied - pointers are "trivially copyable", while Strings are not.

Understanding String and &String

Now let's look at how the heap works and how String is laid out. If you noticed, we have a problem: We can't really work with strings of dynamic length, since we don't know how much memory they need. So we need "dynamic" memory, you might have heard about malloc and free in C. That's what they are for.

Assuming we want the user to input something and we don't know the length of the input at compile time. What we can do is to use a String to push the characters to a buffer that grows dynamically:

image

Here, the user has started to type "farenheit". The String is an object consisting of a pointer (to the data), a length (how much memory is taken up) and a capacity (how much memory is reserved). The capacity is so that we don't need to ask for more memory on every character, but in blocks, growing with the string length (first, request 2 bytes, then 4 bytes, then 8, 16, 32 bytes). The length of the String is always smaller than the capacity.

But now, what is 0x101? It is now a &String, the base address of where the String object starts. The difference between &str and String is simple - the memory layout is completely different. A String is resizable, an str is not. When you call .to_string(), it copies all the bytes from the str to the heap. The heap is rather slow, so use it carefully (doesn't matter for Hello World programs).

Now what happens when the my_print_function returns and the stack shrinks? Memory on the heap is not automatically freed like on the stack. In C, this is a problem called a "memory leak". You have to make sure to actually free the data again. In Rust, the String object has a so-called "destructor", a method that runs before the data on the stack is deleted. A destructor is guaranteed to run, even on panics:

image

So this is what a String and a &String is. Strings cannot be copied, because that would be a perfomance hit. Instead, you can call the String.clone() method, which duplicates the full string. There's more to the story, like why you can't push to a String while holding a reference to it. But that's for another time.

Why you can't return a &String or a &str from a function

This should be fairly obvious now: The stack unwinds in the reverse order, meaning first the local variables of my_print_function get destructed, then the local variables of the parent function (in this case main()) get destructed. So what happens when you return a pointer to a local variable to the parent function? At the time the control flow reaches the parent function, the pointer points to free (uninitialized) memory because the local variables of the sub-function have already been deleted. This is a huge problem in the C and C++ world.

Now, what I just told you contained a big lie: strs that you type in your program, like match my_string { "f" => {}, ... } - the "f" isn't stored on the stack. Instead it is stored in a third region, called the .text section of the program. The code for the .exe file has to be stored somewhere, the operating system has to load the .exe into memory to start executing it. That is where the compiler puts the "f". This is expressed using the 'static (static lifetime or tick static). An str stored this way is known as a &'static str - you have a pointer to the programs .text region. You cannot modify it, it is read only, but the memory location is valid as long as the program is running. The memory cannot be destructed, it is always valid and never changes (until the program ends). This is why it is safe to return a &'static str, but not a regular &str or &String. In the last stream, Rust protected you from this - C wouldn't.

Bending the rules

Rust allows you to bend the rules a bit. To close this extremely long comment I'll hint you at what the problem in your last stream was where you couldn't do the match temperature_symbol:

If you compare the memory layout on the stack of str and String, you'll notice that they look similar. str is: [length, data], while String is [length, pointer, capacity]. So what Rust allows you to do is to pass a &String in places where a &str is required. This is not a special rule, these "magic" conversions can be done and implemented on other types, too, but explaining auto-deref is a bit too much for now.

I am not going to give you the solution though, if you've read this carefully you should now know how to match on a String when Rust expects a &str. But understanding the difference between the stack and the heap is extremely important. You will not like Rust if you don't have this mental model of memory.

Have a great day,

Felix