Add a concept of memory-backed contiguous collection in Base

jakobnissen commented 6 months ago

I propose the creation of a new concrete (parametric) type in Base that corresponds to memory-backed contiguous arrays, and to implement Base methods that specialize on these properties in terms of this type.

Edit: I've made a proof-of-concept package here: https://github.com/BioJulia/MemoryViews.jl

Motivation

Various places in Base and elsewhere, Julia has methods that can operate on any contiguous memory region. Let's call this type of data a dense array. Examples include:

Search methods which work by ccalling memchr, as in string/search.jl
CRC32c
Various IO methods, e.g. read!(::IO, A::AbstractArray), which calls to unsafe_write.

This is currently handled inconsistently in Base:

string/search.jl uses the internal ByteArray union type to cover dense arrays of bytes
CRC32c uses Union{Array{UInt8},FastContiguousSubArray{UInt8,N,<:Array{UInt8}} where N}
read dispatches on AbstractArray, and checks of isbitstype(eltype(A)) && _checkcontiguous(Bool, A).

There are several issues with this handling:

First, all the approaches above fail to cover some important types. I've attempted to address this ad-hoc in #47693, #53178 and #54579. However, this ad-hoc patching is deeply unsatisfying, and I'm certain I have missed several types. In fact, I'm starting to think it's literally impossible to cover all permutations of views, reinterpretarrays, codeunits, memory and vector using a Union. The practical implication is that lots of function calls needlessly fail to hit the fast path, while at the same time, the code is harder to reason about because this ad-hoc implementations uses complex big unions in their signature, and is inconsistent with each other.

The fact that this code is duplicated thrice in Base, and in all three places was lacking important methods, suggests to me that there is a need for a better (internal) API to write methods for dense arrays. That was also my experience when making those three PR's: "Surely, there must be a better way to do this". The main usability issue is that in lieu of any API to cover dense arrays, it's up to the implementer of every method to make sure they've covered non-obvious types like CodeUnits{UInt8, SubString{String}} and SubArray{UInt8, 1, Vector{UInt8}, Tuple{Base.OneTo{Int64}}, true}. Unsurprisingly, people don't correctly do this.

There are also some other, minor issues with the existing approaches: Namely, it causes unnecessary specialization, as there is no reason to compile two methods for e.g. String and Vector{UInt8} if they each just operate on a slice of bytes in memory. Also, it's more difficult to introspect and reason about methods whose signature includes a "big union".

Why not use `DenseArray`?

Because it doesn't cover the correct types:

It doesn't cover SubArray, despite many subarrays being dense vectors.
Currently, we incorrectly have CodeUnits <: DenseVector (#53996). If this is fixed, then e.g. CodeUnits{T, String} will also not be covered by DenseArray.
Users may create a new subtype <: DenseArray without implementing methods like pointer, or indeed without even realizing that subtyping DenseArray requires that their new type must be memory-backed and contiguous.

Also, reinterpret may create dense arrays that are not DenseArrays - however, my proposed implementation doesn't handle this, either.

Proposal

I have cooked up a proof-of-concept package here: https://github.com/BioJulia/MemoryViews.jl which you may see for more details

I propose creating a new, internal Base type with the following layout:

# M is enforced to be :immutable or :mutable
struct MemView{T, M} <: DenseVector{T}
    ref::MemoryRef{T}
    len::Int
end

These types can be constructed from the various contiguous, dense arrays:

MemView(A::Union{Array{T}, Memory{T}}) where {T} = MutableMemView{T}(A.ref, length(A))

function MemView(s::String)
    ImmutableMemView{UInt8}(MemoryRef(unsafe_wrap(Memory{UInt8}, s)), ncodeunits(s))
end

MemView(s::SubString) = MemView(parent(s))[only(parentindices(s))]

[etc...]

Optimised methods can be implemented in terms of MemView, when it's not possible to write optimised methods generic to AbstractArray, and where Union{Vector, Memory} etc. is too restrictive:

function some_function_working_on_memory(mem::MemoryView{<:Union{UInt8, Int8}})
    # the optimised function here, maybe calling ccall or intrinsics
end

Which would only need to be compiled once, and which is conceptually easier to understand then working with a huge union.

Further, my proposal implements a trait tentatively called MemKind, which is IsMemory if the type is semantically equivalent to its MemView. For example, codeunits are semantically equivalent to MemView (both being dense arrays of bytes), whereas strings have a MemView method, but are not semantically equivalent to an array of bytes. The purpose of MemKind is that, for any type implementing that, methods can immediately wrap convert them to a MemView, then pass on to the low-level implementation function.

The surface level API would be along the lines below, using a simple hypothetical find_first_zero function.

find_first_zero(x) = find_first_zero(MemKind(x), x) # dispatch using MemKind trait

function find_first_zero(::MemKind, x) # generic fallback
    for (k,v) in pairs(x)
        iszero(v) && return k
    end
    nothing
end

find_first_zero(::IsMemory{<:MemView{<:Union{UInt8, Int8}}}, x) = find_first_zero(MemView(x))
find_first_zero(mem::MemView{UInt8}) = # call memchr

This proposal has two important features:

At a high level, dispatching on the MemKind trait is more accurate at selecting dense arrays than dispatching on big unions. It both correctly includes complex nested types such as views of codeunits of substrings, while also rejecting incorrect types, such as an incorrectly implemented DenseArray.
At a low level, it is easier to reason about the behaviour (and safety) of low-level methods operating on memory, if it's implemented in terms of a single, concrete MemView type. It's also better for compile times and sysimg size if the methods for many different types all dispatch to the same single implementation.

Alternatives

Instead of creating concrete types, the different parts in Base could more consistently use Base._checkcontiguous and isbitstype, perhaps extracted into a function is_memory_backed_contiguous. Base could then be reviewed for places where memory-backed contiguous types are used, and methods could be added to "funnel" all the compatible types into these methods. Importantly, any alternative approach should include an internal API, such that it makes it easy to write a method that will correctly dispatch if, and only if, the argument is a dense array - which is not the status quo.
MemView could, instead of being backed by Memory, be backed by a simple pointer. In that case, MemView would be much like Random.UnsafeView. The novelty of this proposal would be in the MemKind trait, mostly. See more here.

oscardssmith commented 6 months ago

To me, a type seems like the wrong option since there can be cases where a type meets or doesnt meet the requirements based on it's parameters. As such, I think the alternative solution is probably the better answer.

jakobnissen commented 6 months ago

Maybe I misunderstand, but you could very well have a method that converts MyType{Float32} to MemoryView, but not MyType{Int32}, or what have you. The advantages of a single type still remain: Fewer code specializations, easier reasoning about the core implementation due to restricted types, and the "for free" genericness of the function (i.e. the user doesn't need to remember to cover view of codeunits of substrings of String in their signature)

Seelengrab commented 6 months ago

I think the idea/concept of a memory-backed, contiguous collection type is good in principle, but I'd rather see it implemented through a trait (or, if we ever get it, multiple abstract subtyping) than a bespoke struct that is converted to. My reasoning for this is that I don't think it's good to convert to/from such a type, effectively erasing the type information & invariants of the existing instances in that collection. In the proposed design, nothing stops me from converting e.g. a Vector{UInt64} to a MemoryView{UInt8} and then converting that to something entirely different - surely that's not intended, since the MemoryView is only supposed to be a (temporarily) different view of the existing valid data?

ericphanson commented 6 months ago

I like the proposed approach. I think N “convert generic object to a concrete tightly constrained one” methods followed by M “operate on concrete object” specialized implementations is a better pattern than M functions to “operate on generic objects while making sure to specially handle any of N cases”. There is some sense we are getting the “intermediate abstraction” thing of N*M -> N+M.

Additionally it orthoganalizes concerns which makes each function easier to test and ensure correctness of.

jakobnissen commented 6 months ago

Thanks for your feedback. I've made a proof of concept package here: https://github.com/jakobnissen/MemViews.jl. I had to work a little bit back and forth with it to get an API I liked using, but I must say I like how it turned out.

@Seelengrab to respond to your concerns:

I'd rather see it implemented through a trait

A major part of my motivation is to consolidate the concrete representations of multiple types, to make things easier to reason about, which is important when you want to do low-level stuff, including working with pointers or calling into C. This is not practically possible using an abstract interface, because in these circumstances, you do really need to know the concrete memory layout of your types.

My reasoning for this is that I don't think it's good to convert to/from such a type, effectively erasing the type information & invariants of the existing instances in that collection.

The idea is that the author of a new implementation explicitly calls MemView(x), if x can be operated on in terms of memory. It's completely opt-in, you won't have your type unintentionally dispatch to a mem view method just because you implement MemView(x::MyType). That is, if your type has invariants which means it is not valid to represent it as memory, don't.

In my proof of concept package, you can also opt-in with the trait MemKind if you do want methods being able to treat your type semantically as equivalent to a memory view. This is conceptually similar to implementing AsRef<[T]> in Rust. Types such as strings, which can be represented by memory views, but which are not semantically equivalent to its own memory, would not implement MemKind, but would implement the MemView constructor.

In the proposed design, nothing stops me from converting e.g. a Vector{UInt64} to a MemoryView{UInt8} and then converting that to something entirely different

You can't convert between element types (e.g. Vector{A} to MemView{B} unless A === B). This is not easy to do.

Seelengrab commented 6 months ago

You can't convert between element types (e.g. Vector{A} to MemView{B} unless A === B). This is not easy to do.

I'm not sure I understand the example in your OP with String then - how else other than with unsafe_wrap are you supposed to get that changed eltype? If you don't need to/want to change the element type, surely the existing ccall/convert infrastructure can do that conversion already. For the Julia-side of this, I'd again prefer the traits approach over forcing the use of unsafe_* to create differently typed views into the underlying memory, complicating lifetime analysis immensely. Concretely, I think the comment from your Proof-Of-Concept:

https://github.com/jakobnissen/MemViews.jl/blob/0fbcf578a85a9f370d8a2fdc64a46fedacea83be/src/construction.jl#L10-L12

must be answered with "yes" because the string is not necessarily GC grounded for the entire duration the MemoryView exists. I don't see how this can be done generically for other types either, so I don't see how MemoryView can be done safely at all.

nhz2 commented 6 months ago

Why is the use of unsafe_wrap in the following safe from GC?https://github.com/JuliaLang/julia/blob/4b8fde32b428b467933986989f70f04537eeb438/base/iobuffer.jl#L44 It seems very similar to https://github.com/jakobnissen/MemViews.jl/blob/0fbcf578a85a9f370d8a2fdc64a46fedacea83be/src/construction.jl#L10-L12

vtjnash commented 6 months ago

Memory can hold an internal reference to certain kinds of object, such as String

Seelengrab commented 6 months ago

Memory can hold an internal reference to certain kinds of object, such as String

So if I'm understanding correctly, this only works/is safe because it's special cased for String? In other words, for other objects this is properly unsafe without keeping an actual reference to the object alive too?

JuliaLang / julia