Closed ericphanson closed 3 years ago
(The workaround is to just call Arrow.ArrowTypes.registertype!(UUID, UUID)
before deserializing. But I think the hidden statefulness is still confusing / problematic).
The UUID case is now fixed (defined by default in Arrow) and we've updated the docs to mention the need to call registertype!
. I'm considering some larger changes to type serializing and such, so this might be something we make easier with that.
@quinnj what's the motivation behind https://github.com/JuliaData/Arrow.jl/blob/3ab2b18829c1656198a85759360389b6bbb22ab3/src/arraytypes/struct.jl#L86? Is it just to give the "convenience behavior" listed in the OP or is there a deeper reason? If it's just the former, I wonder if it's better just to remove it...I ran into another related issue just now.
I'm essentially implementing the following (which is also why I needed https://github.com/JuliaData/Arrow.jl/pull/150):
struct Foo ... end
struct _FooArrow ... end
Foo(::_FooArrow) = ...
Arrow.ArrowTypes.registertype!(Foo, _FooArrow)
Arrow.ArrowTypes.arrowconvert(::Type{_FooArrow}, f::Foo) = ...
the above in theory would allow me to have full control over Arrow <-> Julia conversion for my Foo
type.
The problem is that Arrow.jl is automatically calling ArrowTypes.registertype!(_FooArrow, _FooArrow)
on write even though I don't want it to :( as a caller I can't really think of a scenario where I would want auto-registration, but I could be missing something.
Even if we can't get rid of it in general, would it be possible to gate this behavior behind a flag passed to Arrow.write
(autoregister=true
)?
Just to add another reason in favor of removing it, mutating the global registration dict at write-time seems like it could be an issue for concurrent writing from different threads (ref https://github.com/JuliaData/Arrow.jl/issues/90#issuecomment-797516022 for other thread safety issues). Whereas the user could be sure to always manually register outside the threaded region of code.
Yeah, these are good points for removing the auto registering. The main reason for having it was convenience.
@jrevels , can you explain your use-case/example a bit more? What I dont' quite follow is how _FooArrow
will be supported? The 2nd argument to registertype!
should be a native arrow type that your custom type converts to.
Hold up, don't mind me. I'm digging back through all the code and in the structs.jl file we know how to serialize a _FooArrow
, so yeah, I think I understand the example better now.
Wait, backsies again. So the problem with not autoregistering, is that without ArrowTypes.registertype!(_FooArrow, _FooArrow)
, we don't know how to deserialize the struct, it would just deserialize as a NamedTuple. Here's where my thinking is going, though I recognize the code itself doesn't currently reflect this vision:
StructType
s, we'd require users to call ArrowTypes.registertype!
for custom typesArrow.Types.registertype!(::Type{T}) where {T} = registertype!(T, T)
, which means the custom struct would be serialized as-is, and when deserializing, we'd just call (essentially) T(serialized_fields...)
ArrowTypes.registertype!(T, @NamedTuple{field1::Int, field2::String})
, where, e.g., you only want to serialize field1
and field2
of your custom type. This would then require a corresponding definition like: ArrowTypes.arrowconvert(::Type{@NamedTuple{field1::Int, field2::String}}, x::T) = (field1=x.field1, field2=x.field2)
, though I think we could provide some kind of auto-convert fallback, like: ArrowTypes.arrowconvert(::Type{T}, x) where {T <: NamedTuple} = (; nm=>getfield(x, nm) for nm in names(T))
ArrowTypes.arrowconvert(::Type{T}, x::@NamedTuple{field1::Int, field2::Strong}) = T(x.field1, x.field2)
; this would allow "hooking" into deserialization, to fix cases like https://github.com/JuliaData/Arrow.jl/issues/135So the problem with not autoregistering, is that without ArrowTypes.registertype!(_FooArrow, _FooArrow), we don't know how to deserialize the struct, it would just deserialize as a NamedTuple.
Ah, but for me this is the desired behavior :) I want it to deserialize as NamedTuple unless I, the caller, tell it explicitly not to. Right now it feels like Arrow.jl is making the decision for me, and it's making the wrong one (AFAICT).
Reopening this issue as it seems like the discussion may lead to some action items :)
ref https://github.com/beacon-biosignals/Onda.jl/pull/68 for a motivating example.
my thoughts are very rough/not super well-considered yet, but off the top of my head, here are my big "wants" (some of these might already be possible w/ existing behavior):
registertype!
mechanism to somehow be replaced/augmented by method dispatch (motivation here: https://github.com/beacon-biosignals/Onda.jl/pull/68/files#diff-9d1b70fd041b1dbbe08ff4096cf1c68daa131b7d249d2ba3101e9079e129f44cR505 ; if there's another way to do this that I'm not seeing with the current system, that'd be dope). The barriers here AFAICT are a) dynamic dispatch might be slower than the current Dict look up during deserialization, so we'd have to amortize it by doing it up front based on the present extension metadata (I think this should work?) and b) this mechanism would be less dynamic than registertype!
currently is (I would be happy to make it less dynamic, but maybe there's a use case where the extra dynamism is useful?). If handling this starts to look like it requires @eval
, we could consider a combined approach, e.g. keep the current *_MAPPING
Dict but have it contain extension string => Julia function
pairs, to hide Julia's "smart dispatch" behind a "dumb dispatch" tier. That way the mapping would be clear/resolvable without eval
but callers could take advantage of Julia's dispatch for e.g. allowing full use of type parameters.lower
/raise
, toarrow
/fromarrow
, etc. I wouldn't want to use a single function for both as, IME, doing so forces the caller to "do more than they intend to do" by overloading that function. Like, I want to be able to define a pre-serialization hook to go from A
in Julia -> B
in Arrow, without implying anything about the transformation behavior of A
in Arrow -> B
in Julia. preserving some relevant convo from the Julia Slack (https://julialang.slack.com/archives/C674VR0HH/p1615846377461400?thread_ts=1615681758.430500&cid=C674VR0HH)
Closing now that #156 is merged/tagged
E.g.
In a new session:
I believe this is due to the write-time registration of types at
https://github.com/JuliaData/Arrow.jl/blob/3ab2b18829c1656198a85759360389b6bbb22ab3/src/arraytypes/struct.jl#L86