Open jakobnissen opened 6 months ago
To me, a type seems like the wrong option since there can be cases where a type meets or doesnt meet the requirements based on it's parameters. As such, I think the alternative solution is probably the better answer.
Maybe I misunderstand, but you could very well have a method that converts MyType{Float32}
to MemoryView
, but not MyType{Int32}
, or what have you.
The advantages of a single type still remain: Fewer code specializations, easier reasoning about the core implementation due to restricted types, and the "for free" genericness of the function (i.e. the user doesn't need to remember to cover view of codeunits of substrings of String
in their signature)
I think the idea/concept of a memory-backed, contiguous collection type is good in principle, but I'd rather see it implemented through a trait (or, if we ever get it, multiple abstract subtyping) than a bespoke struct that is converted to. My reasoning for this is that I don't think it's good to convert to/from such a type, effectively erasing the type information & invariants of the existing instances in that collection. In the proposed design, nothing stops me from converting e.g. a Vector{UInt64}
to a MemoryView{UInt8}
and then converting that to something entirely different - surely that's not intended, since the MemoryView
is only supposed to be a (temporarily) different view of the existing valid data?
I like the proposed approach. I think N “convert generic object to a concrete tightly constrained one” methods followed by M “operate on concrete object” specialized implementations is a better pattern than M functions to “operate on generic objects while making sure to specially handle any of N cases”. There is some sense we are getting the “intermediate abstraction” thing of N*M -> N+M.
Additionally it orthoganalizes concerns which makes each function easier to test and ensure correctness of.
Thanks for your feedback. I've made a proof of concept package here: https://github.com/jakobnissen/MemViews.jl. I had to work a little bit back and forth with it to get an API I liked using, but I must say I like how it turned out.
@Seelengrab to respond to your concerns:
I'd rather see it implemented through a trait
A major part of my motivation is to consolidate the concrete representations of multiple types, to make things easier to reason about, which is important when you want to do low-level stuff, including working with pointers or calling into C. This is not practically possible using an abstract interface, because in these circumstances, you do really need to know the concrete memory layout of your types.
My reasoning for this is that I don't think it's good to convert to/from such a type, effectively erasing the type information & invariants of the existing instances in that collection.
The idea is that the author of a new implementation explicitly calls MemView(x)
, if x
can be operated on in terms of memory. It's completely opt-in, you won't have your type unintentionally dispatch to a mem view method just because you implement MemView(x::MyType)
. That is, if your type has invariants which means it is not valid to represent it as memory, don't.
In my proof of concept package, you can also opt-in with the trait MemKind
if you do want methods being able to treat your type semantically as equivalent to a memory view. This is conceptually similar to implementing AsRef<[T]>
in Rust. Types such as strings, which can be represented by memory views, but which are not semantically equivalent to its own memory, would not implement MemKind
, but would implement the MemView
constructor.
In the proposed design, nothing stops me from converting e.g. a Vector{UInt64} to a MemoryView{UInt8} and then converting that to something entirely different
You can't convert between element types (e.g. Vector{A}
to MemView{B}
unless A === B
). This is not easy to do.
You can't convert between element types (e.g. Vector{A} to MemView{B} unless A === B). This is not easy to do.
I'm not sure I understand the example in your OP with String
then - how else other than with unsafe_wrap
are you supposed to get that changed eltype? If you don't need to/want to change the element type, surely the existing ccall
/convert
infrastructure can do that conversion already. For the Julia-side of this, I'd again prefer the traits approach over forcing the use of unsafe_*
to create differently typed views into the underlying memory, complicating lifetime analysis immensely. Concretely, I think the comment from your Proof-Of-Concept:
must be answered with "yes" because the string is not necessarily GC grounded for the entire duration the MemoryView
exists. I don't see how this can be done generically for other types either, so I don't see how MemoryView
can be done safely at all.
Why is the use of unsafe_wrap
in the following safe from GC?https://github.com/JuliaLang/julia/blob/4b8fde32b428b467933986989f70f04537eeb438/base/iobuffer.jl#L44
It seems very similar to https://github.com/jakobnissen/MemViews.jl/blob/0fbcf578a85a9f370d8a2fdc64a46fedacea83be/src/construction.jl#L10-L12
Memory can hold an internal reference to certain kinds of object, such as String
Memory can hold an internal reference to certain kinds of object, such as String
So if I'm understanding correctly, this only works/is safe because it's special cased for String
? In other words, for other objects this is properly unsafe without keeping an actual reference to the object alive too?
I propose the creation of a new concrete (parametric) type in Base that corresponds to memory-backed contiguous arrays, and to implement Base methods that specialize on these properties in terms of this type.
Edit: I've made a proof-of-concept package here: https://github.com/BioJulia/MemoryViews.jl
Motivation
Various places in Base and elsewhere, Julia has methods that can operate on any contiguous memory region. Let's call this type of data a dense array. Examples include:
memchr
, as instring/search.jl
read!(::IO, A::AbstractArray)
, which calls tounsafe_write
.This is currently handled inconsistently in Base:
string/search.jl
uses the internalByteArray
union type to cover dense arrays of bytesUnion{Array{UInt8},FastContiguousSubArray{UInt8,N,<:Array{UInt8}} where N}
read
dispatches onAbstractArray
, and checks ofisbitstype(eltype(A)) && _checkcontiguous(Bool, A)
.There are several issues with this handling:
First, all the approaches above fail to cover some important types. I've attempted to address this ad-hoc in #47693, #53178 and #54579. However, this ad-hoc patching is deeply unsatisfying, and I'm certain I have missed several types. In fact, I'm starting to think it's literally impossible to cover all permutations of views, reinterpretarrays, codeunits, memory and vector using a
Union
. The practical implication is that lots of function calls needlessly fail to hit the fast path, while at the same time, the code is harder to reason about because this ad-hoc implementations uses complex big unions in their signature, and is inconsistent with each other.The fact that this code is duplicated thrice in Base, and in all three places was lacking important methods, suggests to me that there is a need for a better (internal) API to write methods for dense arrays. That was also my experience when making those three PR's: "Surely, there must be a better way to do this". The main usability issue is that in lieu of any API to cover dense arrays, it's up to the implementer of every method to make sure they've covered non-obvious types like
CodeUnits{UInt8, SubString{String}}
andSubArray{UInt8, 1, Vector{UInt8}, Tuple{Base.OneTo{Int64}}, true}
. Unsurprisingly, people don't correctly do this.There are also some other, minor issues with the existing approaches: Namely, it causes unnecessary specialization, as there is no reason to compile two methods for e.g.
String
andVector{UInt8}
if they each just operate on a slice of bytes in memory. Also, it's more difficult to introspect and reason about methods whose signature includes a "big union".Why not use
DenseArray
?Because it doesn't cover the correct types:
SubArray
, despite many subarrays being dense vectors.CodeUnits <: DenseVector
(#53996). If this is fixed, then e.g.CodeUnits{T, String}
will also not be covered byDenseArray
.<: DenseArray
without implementing methods likepointer
, or indeed without even realizing that subtypingDenseArray
requires that their new type must be memory-backed and contiguous.Also, reinterpret may create dense arrays that are not
DenseArray
s - however, my proposed implementation doesn't handle this, either.Proposal
I have cooked up a proof-of-concept package here: https://github.com/BioJulia/MemoryViews.jl which you may see for more details
I propose creating a new, internal Base type with the following layout:
These types can be constructed from the various contiguous, dense arrays:
Optimised methods can be implemented in terms of
MemView
, when it's not possible to write optimised methods generic toAbstractArray
, and whereUnion{Vector, Memory}
etc. is too restrictive:Which would only need to be compiled once, and which is conceptually easier to understand then working with a huge union.
Further, my proposal implements a trait tentatively called
MemKind
, which isIsMemory
if the type is semantically equivalent to itsMemView
. For example, codeunits are semantically equivalent toMemView
(both being dense arrays of bytes), whereas strings have aMemView
method, but are not semantically equivalent to an array of bytes. The purpose ofMemKind
is that, for any type implementing that, methods can immediately wrap convert them to aMemView
, then pass on to the low-level implementation function.The surface level API would be along the lines below, using a simple hypothetical
find_first_zero
function.This proposal has two important features:
MemKind
trait is more accurate at selecting dense arrays than dispatching on big unions. It both correctly includes complex nested types such as views of codeunits of substrings, while also rejecting incorrect types, such as an incorrectly implementedDenseArray
.MemView
type. It's also better for compile times and sysimg size if the methods for many different types all dispatch to the same single implementation.Alternatives
Instead of creating concrete types, the different parts in Base could more consistently use
Base._checkcontiguous
andisbitstype
, perhaps extracted into a functionis_memory_backed_contiguous
. Base could then be reviewed for places where memory-backed contiguous types are used, and methods could be added to "funnel" all the compatible types into these methods. Importantly, any alternative approach should include an internal API, such that it makes it easy to write a method that will correctly dispatch if, and only if, the argument is a dense array - which is not the status quo.MemView could, instead of being backed by
Memory
, be backed by a simple pointer. In that case,MemView
would be much likeRandom.UnsafeView
. The novelty of this proposal would be in theMemKind
trait, mostly. See more here.