MasonProtter / SumTypes.jl

An implementation of sum types in Julia
MIT License
103 stars 8 forks source link

Noob question: performance tradeoffs vs. the MLStyle approach to sum types #29

Closed lukemerrick closed 1 year ago

lukemerrick commented 1 year ago

Hi there! I'm really impressed with the technical depth of this package, and I'm keen on trying to deepen my understanding of how it's designed and why.

My question

Assuming my understanding of the SumTypes and MLStyle approaches to implementing sum types is correct (see below), my general question is "what are the performance tradeoffs between these approaches?"

Sub questions:

My (possibly faulty) understanding of how we can implement sum types in Julia

As far as I can tell, there are two main options to offering an algebraic datatype experience in Julia.

First there's SumTypes, which like Unityper defines a single struct which can serve to store and access one of several different "conceptual" types.

E.g.

# Let's look under the hood of SumTypes.@sum_type.
macroexpand(
    @__MODULE__,
    :(
        SumTypes.@sum_type Example begin
            Stringy(value::String)
            Symbolic(name::Symbol)
        end
    )
)

# We get this.
($(Expr(:toplevel, quote
    #= /home/luke/.julia/packages/SumTypes/qpNRC/src/sum_type.jl:260 =#
    struct Example{var"#N#", var"#M#", var"#FT#"}
        bits::(Tuple{Vararg{T, N}} where {N, T}){var"#N#", UInt8}
        ptrs::(Tuple{Vararg{T, N}} where {N, T}){var"#M#", Any}
        var"#tag#"::var"#FT#"
        1 + 1
    end
...  # <truncated>

The rest of the generated code is pretty involved, so I don't follow completely, but my understanding is that we're doing some pretty heavy lifting to somewhat reimpliment structs at a pretty low level (including unsafe memory viewing?). My guess is that we do this for performance reasons, but I don't really understand where the performance gains come from.

On the other hand, MLStyle seems to just be providing syntactic sugar around what I'm assuming will eventually be equivalent to isa checks on multiple different concrete implementations of an abstract type. In other words, still using Julia's type system to define an abstract type and several concrete subtypes (one for each of the possible types of the sum type).

# Let's look under the hood of MLStyle.@data.
macroexpand(
    @__MODULE__,
    :(
        MLStyle.@data Example begin 
            Stringy(value::String)
            Symbolic(name::Symbol)
        end
    )
)

quote
    abstract type Example end
    struct Stringy{} <: Example
        #= REPL[7]:5 =#
        value::String
    end
    nothing
    begin
        #= /home/luke/.julia/packages/MLStyle/SLOsr/src/Record.jl:95 =#
        #= REPL[7]:5 =#
        #= /home/luke/.julia/packages/MLStyle/SLOsr/src/Record.jl:96 =#
        function (MLStyle).pattern_uncall(t::Type{<:Stringy}, self::Function, type_params::Any, type_args::Any, args::Any)
            #= /home/luke/.julia/packages/MLStyle/SLOsr/src/Record.jl:96 =#
            #= /home/luke/.julia/packages/MLStyle/SLOsr/src/Record.jl:103 =#
            #= REPL[7]:5 =#
            #= /home/luke/.julia/packages/MLStyle/SLOsr/src/Record.jl:104 =#
            (MLStyle.Record._compile_record_pattern)(t, self, type_params, type_args, args)
        end
    end
    struct Symbolic{} <: Example
        #= REPL[7]:6 =#
        name::Symbol
    end
    nothing
    begin
        #= /home/luke/.julia/packages/MLStyle/SLOsr/src/Record.jl:95 =#
        #= REPL[7]:6 =#
        #= /home/luke/.julia/packages/MLStyle/SLOsr/src/Record.jl:96 =#
        function (MLStyle).pattern_uncall(t::Type{<:Symbolic}, self::Function, type_params::Any, type_args::Any, args::Any)
            #= /home/luke/.julia/packages/MLStyle/SLOsr/src/Record.jl:96 =#
            #= /home/luke/.julia/packages/MLStyle/SLOsr/src/Record.jl:103 =#
            #= REPL[7]:6 =#
            #= /home/luke/.julia/packages/MLStyle/SLOsr/src/Record.jl:104 =#
            (MLStyle.Record._compile_record_pattern)(t, self, type_params, type_args, args)
        end
    end
end

This seems like it offers fewer "sharp edges" with respect to matching the standard mental model of types in Julia, but I'm guessing it is trading off performance somewhere, and I'm hoping to better understand where!

MasonProtter commented 1 year ago

Hi @lukemerrick, sorry for the slow response. Yes, I think your summary / understanding is more or less accurate, though I don't really count MLStyle.jl's @adt types as real sum types. As you point out, they're actually just syntactic sugar for subtyping and union splitting.

Regarding your sub-questions

  • Are the structs defined by SumTypes fixed size no matter which conceptual type they represent, or does their generic typing afford flexibility there?
    • If so, does that trade off potentially higher memory usage?
    • If so, does it enable performance wins for things like an Array, since indexing is simpler?
    • If not, does that trade off performance for things like Arrays?

SumTypes storage size is fixed for every combination of parameters. So if you have Either{Int, Int}, that'll have one fixed size, but Either{Int, NTuple{100, Int}} will have a different size. We're mostly getting the best of both worlds here. Unlike something like MLStyle, a SumType has a fixed layout so it can be allocated inline in an array, and only take up however much space the largest variant needs (plus space for the tag), this causes very big performance wins in arrays.

  • What kinds of use cases would one of the two approaches offer substantially better performance than the other, and in which kind of use cases would the performance difference be small?

In many circumstances they should be rather similar, but any time you put a SumType in a container, and read from that container, I'd generally expect SumTypes to be at a significant advantage.

  • In general (beyond the world of sum types), how bad is it to use an abstract type vs. a union on concrete types vs. a single concrete type for things like Arrays?

Hard to say in general. It's usually rather slow to have boxed abstract storage in an array, but sometimes you don't care about that overhead.