cda-group / arc

Programming Language for Continuous Deep Analytics
https://cda-group.github.io/arc/
44 stars 6 forks source link

RFC: Decide how to (initially) manage data #293

Closed segeljakt closed 3 years ago

segeljakt commented 3 years ago

This issue is about deciding how to manage data which could either be stored in persistent storage or in-memory. This is a complicated issue since there is a huge number of solutions, each with their own tradeoffs in expressiveness and efficiency.

Motivation

The issue of data management arises when trying to support non-primitive data types, such as strings, structs, enums, dynamic arrays #290, lists, trees, etc. Rust's builtin memory-management, that relies on ownership (linear types), mutable/immutable references and lifetimes, does not align with our goals since it focuses on efficiency over usability, and therefore comes at a steep cost in complexity. To support any data type which cannot be copied/stored on the stack (anything beyond an integer or float), automatic data management is needed.

Problem

There are 3 types of data in our system. 1) Data which needs to be stored in persistent state (since the data belongs to a variable which is in the scope of a blocking operation, or is just too big to be stored in memory) 2) Data which can be stored in memory/cache (since it belongs to a temporary intermediate value, or variable which is not in the scope of a blocking operation). 3) Data which must be sent over network (since it is used as argument in an emit operation, or parameter in a receive operation)

In many cases the three kinds overlap:

task Foo(): ~i32 -> ~i32 {
    {
        val x = 0; // Type-2
    }
    loop {
        var y = 0; // Type-1, Type-3
        on z => { //  Type-3
            y += z;
            emit y
        }
    }
}

Heap vs Stack

In addition, data can either be stored on stack or heap. Scalars can be stored on stack while dynamically sized data types (vectors etc) must generally be stored on heap.

Mutability vs Immutability

Variables may be declared as mutable (var) or immutable (val). This mutability, known as exterior mutability, allows the variable to be re-assigned after initialisation. Safe Rust only permits exterior mutability, but Arc-Script's type system also allows interior mutability. Interior mutability means the data stored by a variable can be updated behind the scenes when performing an operation on it, even if the variable is declared as immutable. A problem we keep in mind when supporting interior mutability is that sharing is no longer equivalent to copying. For example:

val x = "  foo  ";
val y = x;
val z = x.trim_spaces();

If the semantics copy x when it is used, then y != z, however if the semantics use sharing, then y == x.

Solution

In the solution space there is a lot of optimisation potential. I think we should however focus on the easiest approach first, which in my view is to start with a solution that covers all 3 cases. This should be possible to support in arctime (through Approach 6 described in the next section). To handle Type-3 we need to implement conversions between sendable and non-sendable types.

Background

This section goes through the ideas of different approaches for how to manage memory for dynamic arrays in the language.

Approach 1 (Ownership)

The naive implementation is to have something like the following:

struct DynamicArray<T> {
    concrete: Vec<T>,
}

Here, DynamicArray<T> is an abstract data type implemented on top of a Vec<T>. The properties of DynamicArray<T> is it is ephemeral (only a single version exists of each instance), is mutable, and is not shareable (unless we support passing it by reference &T/&mut T, but this complexity is something we want to avoid in arc-script). This is the most efficient solution in many cases, but also restrictive.

Approach 2 (CoW)

#[derive(Clone)]
struct DynamicArray<T> {
    concrete: Vec<T>,
}

To enable immutable-sharing, one approach is to make the DynamicArray<T> Clone-able. This means the array would be cloned just before mutating it ("Clone on Write"). This is very inefficient, but it is an option.

Approach 3 (Reference counting)

#[derive(Clone)]
struct DynamicArray<T> {
    concrete: Rc<Vec<T>>,
}

An alternative is to reference count the vector. This means DynamicArray<T> can be shared immutably without having to clone the entire structure, but may no longer be mutated.

Approach 4 (Reference counting + checked interior mutability)

#[derive(Clone)]
struct DynamicArray<T> {
    concrete: Rc<RefCell<Vec<T>>>,
}

To enable mutation it is possible to wrap the Vec<T> inside a RefCell. This means we can share the RefCell and mutate whatever is inside it (while checking for ownership violations at runtime).

Approach 5 (Reference counting + unchecked interior mutability)

#[derive(Clone)]
struct DynamicArray<T> {
    concrete: Rc<UnsafeCell<Vec<T>>>,
}

Arc-Script does not have first-class references. Therefore there will never be a point where we are holding onto two dynamically borrowed references of the same cell at the same time. I do not think there is a point in spending cycles on ownership checks for this reason. UnsafeCell solves this problem by disabling the checks (using raw pointers instead).

Approach 6 (Intra-component sharing with reference counting + inter-component copying + unchecked interior mutability)

In the bigger picture, arc-script programs consist of multiple tasks that map 1-1 to Kompact components. Components are executed by threads and can send/receive messages to/from other components. How indirection is managed now also depends on whether we want to be able to share data between components or not. Reference counting makes it possible to share data within a component (intra-component sharing), but not between components (inter-component sharing). Therefore, when passing data between two components in Approach 4 we must always make a deep clone of it ("Clone on Send").

struct UnsafeRc<T>(Rc<T>);
unsafe impl<T> Send for UnsafeRc<T> {}
#[derive(Clone)]
struct DynamicArray<T> {
    concrete: UnsafeRc<UnsafeCell<Vec<T>>>,
}

One issue is reference counted values are not thread safe and thus we cannot store them inside components. We might be able to ensure that a components state is not accessed concurrently. If it is the case that Rc is not used by multiple threads we could force the Rust compiler to think that Rc is thread safe. (Not sure how safe this is). It is also possible to just do the unsafe impl Send directly on the component. The reason why a component is Send is so it can be sent safely between threads. (From a discussion with Max), Kompact uses work-stealing where worker threads in thread pool take (move) components and execute them. When a component is re-scheduled we must ensure that all references to reference counted values are stored inside the component only. This ensures no references "escape" to other components which would not be thread safe. These invariants must be enforced by our code generator.

Approach 7 (Intra- & inter-component sharing with atomic reference counting + checked interior mutability)

#[derive(Clone)]
struct DynamicArray<T> {
    concrete: Arc<Mutex<Vec<T>>>,
}

Another option is to use atomic reference counting. This allows a value to be shared by multiple components, as long as those components are executed by threads in the same process. If we are going to support interior mutability in this context we must also replace RefCell/UnsafeCell with Mutex. Atomic reference counting can however be in the ballpark of 100x slower than reference counting.

Approach 8 (Garbage collection + unchecked interior mutability)

The pros of reference counting are that it 1) Runs in the foreground and therefore knowing when destructors are run becomes predictable. 2) Frees memory as soon as possible.

The cons of reference counting are that it: 1) Cannot deal with cyclic references 2) Space overhead (Need to store a u64 for each object) 3) Speed overhead (Need to increment/decrement counters) 4) Requires atomicity in multithreaded environments 5) Can block a thread for a potentially long time when deallocating large recursive structures such as trees.

To subvert the cons of reference counting, an option we have is to design and implement our own garbage collector for streaming.

#[derive(Clone)]
struct DynamicArray<T> {
    concrete: Gc<UnsafeCell<Vec<T>>>,
}

There are in addition many different kinds of garbage collectors which already exist in Rust. Some focus on ease-of-use while others focus on performance. Our advantage of having a compiler means we could generate all the boilerplate code, such as for example passing around the garbage collector implicitly as an argument to functions. I think the most interesting thing about a custom garbage collector is we can tailor it to support streaming workloads, instead of a general-purpose solution.

Approach 9 (Arena allocation)

struct DynamicArray<T> {
    concrete: Vec<T, Arena>
}

Arena allocation is another approach which assigns each object to an arena (region, zone, etc.). The arena is a chunk of memory storing a collection of objects which can be deallocated en masse.

The pros of arena allocation are that 1) Deallocation is an instant operation since the arena never needs to be deallocated. 2) Allocation is fast since memory can be reused. 3) The lifetime of all objects in the arena is the same, which allows for cyclic structures within the arena.

The cons of arena allocation are that 1) Objects cannot be deallocated individually, which means memory can run out. 2) Some arenas only support homogeneously typed objects. 3) Destructors are not run unless explicitly asked.

Type Conversions

If we stick to the approach which allows sharing values within but not between tasks, then we must have two different representations of values: one which can be operated upon, and one which can be sent. Types can be categorised as either

Types are represented as either "Operable" or "Sendable":

use std::cell::RefCell;
use std::rc::Rc;

For structs, the two representations would look something like this:

struct SendableStruct {
    foo: i32,
    bar: String,
    baz: Box<SendableStruct>,
}

struct OperableStruct {
    foo: i32,
    bar: Rc<RefCell<String>>,
    baz: Rc<RefCell<OperableStruct>>,
}

And enums like this:

enum SendableEnum {
    Foo(i32),
    Bar(String),
    Baz(Box<SendableEnum>),
}

enum OperableEnum {
    Foo(i32),
    Bar(Rc<RefCell<String>>),
    Baz(Rc<RefCell<OperableEnum>>),
}

Conversions between OperableStruct and SendableStruct would look like this:

impl From<Rc<RefCell<OperableStruct>>> for SendableStruct {
    fn from(from: Rc<RefCell<OperableStruct>>) -> Self {
        let from = &*from.borrow();
        let into = Self {
            foo: from.foo,
            bar: from.bar.borrow().clone(),
            baz: Box::new(SendableStruct::from(from.baz.clone())),
        };
        into
    }
}

impl From<SendableStruct> for OperableStruct {
    fn from(from: SendableStruct) -> Self {
        Self {
            foo: from.foo,
            bar: Rc::new(RefCell::new(from.bar)),
            baz: Rc::new(RefCell::new(OperableStruct::from(*from.baz))),
        }
    }
}

And conversion between OperableEnum and SendableEnum like this:

impl From<Rc<RefCell<OperableEnum>>> for SendableEnum {
    fn from(from: Rc<RefCell<OperableEnum>>) -> Self {
        match &*from.borrow() {
            OperableEnum::Foo(x) => Self::Foo(*x),
            OperableEnum::Bar(x) => Self::Bar(x.borrow().clone()),
            OperableEnum::Baz(x) => Self::Baz(Box::new(SendableEnum::from(x.clone()))),
        }
    }
}

impl From<SendableEnum> for OperableEnum {
    fn from(from: SendableEnum) -> Self {
        match from {
            SendableEnum::Foo(x) => Self::Foo(x),
            SendableEnum::Bar(x) => Self::Bar(Rc::new(RefCell::new(x))),
            SendableEnum::Baz(x) => Self::Baz(Rc::new(RefCell::new(OperableEnum::from(*x)))),
        }
    }
}
fn main() {
    let x = Rc::new(RefCell::new(OperableEnum::Foo(todo!())));
    let y: SendableEnum = x.into();

    let x = Rc::new(RefCell::new(OperableStruct {
        foo: todo!(),
        bar: todo!(),
        baz: todo!(),
    }));
    let y: SendableStruct = x.into();
}

Arcorn could derive this information as long as it is given sufficient information about what is copyable/boxed/unboxed. For example:

#[arcorn::rewrite]
struct OperableStruct {
    #[primitive]
    foo: i32,
    #[abstract = String]
    bar: Rc<RefCell<String>>,
    #[algebraic = OperableStruct]
    baz: Rc<RefCell<OperableStruct>>,
}
#[arcorn::rewrite]
enum OperableEnum {
    #[primitive]
    Foo(i32),
    #[abstract = String]
    Bar(Rc<RefCell<String>>),
    #[algebraic = OperableEnum]
    Baz(Rc<RefCell<OperableEnum>>),
}

We technically only need to generate conversion-code for types which are sent between tasks. In other words, types of ports. In this case we need to keep track of what types are inside and not inside of ports. For simplicity maybe it is best to generate conversion code for all types.

Variable Representation

Decide on where to place each Rc<RefCell<...>>.

frej commented 3 years ago

Is there anything that forces us to select a single representation? Shouldn't we have enough information to know when an object in memory can be safely modified in place; when it is not yet reachable from other components; when we know that we're not going to modify it anymore so we can hand it off to another component?

segeljakt commented 3 years ago

Good point 👍 I think we could also try implement our own garbage collector Gc<T> or reference counter Rc<T>.

segeljakt commented 3 years ago

Added notes about conversions. Let me if you have any opinions or would like clarification

segeljakt commented 3 years ago

This is addressed by #299 so I will close this issue for now.