apache / arrow-julia

Official Julia implementation of Apache Arrow
https://arrow.apache.org/julia/
Other
285 stars 59 forks source link

reconsidering the current type registration/serialization mechanism (and its internal usage) #88

Closed ericphanson closed 3 years ago

ericphanson commented 3 years ago

E.g.

julia> using Arrow, UUIDs                           

julia> table = (;x = [uuid4() for _ = 1:5])                                                                               
(x = UUID[UUID("03fec02f-3d33-4cf8-88e0-1e152faccff7"), UUID("a789baa7-9059-4bc4-b0d9-2af1dca4bf88"), UUID("2a4ecfd7-049e-
468d-8fed-8f967b732ee0"), UUID("91b9cad6-b84c-4ee8-aeaf-7448778fb1d5"), UUID("39edead7-1670-430a-8e4d-fefe3615c2d0")],)   

julia> Arrow.write("uuids.arrow", table)                                                                                  
"uuids.arrow"                                                                                                             

julia> Arrow.Table("uuids.arrow")
Arrow.Table: (x = UUID[UUID("03fec02f-3d33-4cf8-88e0-1e152faccff7"), UUID("a789baa7-9059-4bc4-b0d9-2af1dca4bf88"), UUID("2
a4ecfd7-049e-468d-8fed-8f967b732ee0"), UUID("91b9cad6-b84c-4ee8-aeaf-7448778fb1d5"), UUID("39edead7-1670-430a-8e4d-fefe361
5c2d0")],)

In a new session:

julia> using Arrow, UUIDs

julia> Arrow.Table("uuids.arrow")
┌ Warning: unsupported ARROW:extension:name type: "JuliaLang.UUID"
└ @ Arrow.ArrowTypes ~/.julia/packages/Arrow/CyJ4L/src/arrowtypes.jl:141
Arrow.Table: (x = NamedTuple{(:value,),Tuple{UInt128}}[(value = 0x03fec02f3d334cf888e01e152faccff7,), (value = 0xa789baa790594bc4b0d92af1dca4bf88,), (value = 0x2a4ecfd7049e468d8fed8f967b732ee0,), (value = 0x91b9cad6b84c4ee8aeaf7448778fb1d5,), (value = 0x39edead71670430a8e4dfefe3615c2d0,)],)

I believe this is due to the write-time registration of types at

https://github.com/JuliaData/Arrow.jl/blob/3ab2b18829c1656198a85759360389b6bbb22ab3/src/arraytypes/struct.jl#L86

ericphanson commented 3 years ago

(The workaround is to just call Arrow.ArrowTypes.registertype!(UUID, UUID) before deserializing. But I think the hidden statefulness is still confusing / problematic).

quinnj commented 3 years ago

The UUID case is now fixed (defined by default in Arrow) and we've updated the docs to mention the need to call registertype!. I'm considering some larger changes to type serializing and such, so this might be something we make easier with that.

jrevels commented 3 years ago

@quinnj what's the motivation behind https://github.com/JuliaData/Arrow.jl/blob/3ab2b18829c1656198a85759360389b6bbb22ab3/src/arraytypes/struct.jl#L86? Is it just to give the "convenience behavior" listed in the OP or is there a deeper reason? If it's just the former, I wonder if it's better just to remove it...I ran into another related issue just now.

I'm essentially implementing the following (which is also why I needed https://github.com/JuliaData/Arrow.jl/pull/150):

struct Foo ... end

struct _FooArrow ... end

Foo(::_FooArrow) = ...

Arrow.ArrowTypes.registertype!(Foo, _FooArrow)

Arrow.ArrowTypes.arrowconvert(::Type{_FooArrow}, f::Foo) = ...

the above in theory would allow me to have full control over Arrow <-> Julia conversion for my Foo type.

The problem is that Arrow.jl is automatically calling ArrowTypes.registertype!(_FooArrow, _FooArrow) on write even though I don't want it to :( as a caller I can't really think of a scenario where I would want auto-registration, but I could be missing something.

Even if we can't get rid of it in general, would it be possible to gate this behavior behind a flag passed to Arrow.write (autoregister=true)?

ericphanson commented 3 years ago

Just to add another reason in favor of removing it, mutating the global registration dict at write-time seems like it could be an issue for concurrent writing from different threads (ref https://github.com/JuliaData/Arrow.jl/issues/90#issuecomment-797516022 for other thread safety issues). Whereas the user could be sure to always manually register outside the threaded region of code.

quinnj commented 3 years ago

Yeah, these are good points for removing the auto registering. The main reason for having it was convenience.

quinnj commented 3 years ago

@jrevels , can you explain your use-case/example a bit more? What I dont' quite follow is how _FooArrow will be supported? The 2nd argument to registertype! should be a native arrow type that your custom type converts to.

quinnj commented 3 years ago

Hold up, don't mind me. I'm digging back through all the code and in the structs.jl file we know how to serialize a _FooArrow, so yeah, I think I understand the example better now.

quinnj commented 3 years ago

Wait, backsies again. So the problem with not autoregistering, is that without ArrowTypes.registertype!(_FooArrow, _FooArrow), we don't know how to deserialize the struct, it would just deserialize as a NamedTuple. Here's where my thinking is going, though I recognize the code itself doesn't currently reflect this vision:

jrevels commented 3 years ago

So the problem with not autoregistering, is that without ArrowTypes.registertype!(_FooArrow, _FooArrow), we don't know how to deserialize the struct, it would just deserialize as a NamedTuple.

Ah, but for me this is the desired behavior :) I want it to deserialize as NamedTuple unless I, the caller, tell it explicitly not to. Right now it feels like Arrow.jl is making the decision for me, and it's making the wrong one (AFAICT).

Reopening this issue as it seems like the discussion may lead to some action items :)

jrevels commented 3 years ago

ref https://github.com/beacon-biosignals/Onda.jl/pull/68 for a motivating example.

my thoughts are very rough/not super well-considered yet, but off the top of my head, here are my big "wants" (some of these might already be possible w/ existing behavior):

jrevels commented 3 years ago

preserving some relevant convo from the Julia Slack (https://julialang.slack.com/archives/C674VR0HH/p1615846377461400?thread_ts=1615681758.430500&cid=C674VR0HH)

Screen Shot 2021-03-16 at 10 27 42 PM
quinnj commented 3 years ago

Closing now that #156 is merged/tagged