ExpandingMan / Arrow.jl

DEPRECATED in favor of [JuliaData/Arrow.jl](https://github.com/JuliaData/Arrow.jl)
Other
56 stars 9 forks source link

Metadata and IPC part of the spec #4

Open davidanthoff opened 6 years ago

davidanthoff commented 6 years ago

I might just have missed it, but right now this package doesn't cover https://arrow.apache.org/docs/metadata.html and https://arrow.apache.org/docs/ipc.html, right?

Aren't those two parts the stuff you described as lacking in the arrow spec?

For example, to fully interop with the javascript side as described in https://github.com/apache/arrow/tree/master/js, wouldn't we require those parts as well?

ExpandingMan commented 6 years ago

You're right that I don't cover the first part of this this about schemas yet nor have I worked on any inter-process communication stuff. I do however support the Arrow date time data types (don't have a section in README on it yet).

I have to admit to being somewhat confused by this page. What I was referring to in the other thread was that these schemas don't seem to specify an explicit way of communicating all of the pointers to data, it mainly seems concerned with giving some sort of summary metadata (am I wrong?). For an example of what I'm talking about, the Feather metadata format seems to have been completely pulled out of thin air, and doesn't seem related to these pages much at all, but what it describes is exactly the sort of thing you'd need to actually pull data out of a buffer.

I'm pretty sure that the stuff that appears in the documents you linked can simply be appended, and that there's nothing about the existing structure of Arrow.jl that would preclude this type of use case.

davidanthoff commented 6 years ago

I think the RecordBatch and then the Buffer are essentially the pointers to the data? But yes, that whole writing seems not super clear...

I think the Feather format is essentially a different meta data format relative to the arrow metadata format?

ExpandingMan commented 6 years ago

Perhaps, but I don't find the metadata documentation clear at all.

Yes, it does seem that the Feather metadata is just it's own thing. I remain confused as to whether this is because Arrow doesn't specify it or Feather is just being difficult.

davidanthoff commented 6 years ago

Maybe one can figure it out by looking at existing implementations? The Typescript one is probably easy to digest.

randyzwitch commented 6 years ago

I'm going to try and figure this out, as I need the Schema type for a work project.

randyzwitch commented 6 years ago

For the schema, from my reading it seems like you receive a pointer/size as your response, with the schema taking the shape of:

table Schema {

  /// endianness of the buffer
  /// it is Little Endian by default
  /// if endianness doesn't match the underlying system then the vectors need to be converted
  endianness: Endianness=Little;

  fields: [Field];
  // User-defined metadata
  custom_metadata: [ KeyValue ];
}

Unfortunately right now, I don't quite see how to use FlatBuffers.jl to read this, though I'm waiting for some help from internal (C++) folks who can confirm that the schema they are passing me is in fact the above

ExpandingMan commented 6 years ago

For an example of how to get FlatBuffers.jl to read it, look at Feather.jl.

randyzwitch commented 6 years ago

@ExpandingMan Unfortunately, Feather.jl doesn't quite help. I was hoping I could get away with this:

mutable struct Schema
    endianess::String
    fields::AbstractVector
    custom_metadata::String
end

julia> FlatBuffers.readbuffer(schema, 1, Schema)
ERROR: MethodError: Cannot `convert` an object of type Type{Schema} to an object of type Array{UInt8,1}
This may have arisen from a call to the constructor Array{UInt8,1}(...),
since type constructors fall back to convert methods.
Stacktrace:
 [1] read(::Base.AbstractIOBuffer{SubArray{UInt8,1,Array{UInt8,1},Tuple{UnitRange{Int64}},true}}, ::Type{T} where T) at ./io.jl:528
 [2] readbuffer(::Array{UInt8,1}, ::Int64, ::Type{Schema}) at /Users/randyzwitch/.julia/v0.6/FlatBuffers/src/internals.jl:9
 [3] macro expansion at /Users/randyzwitch/.julia/v0.6/Atom/src/repl.jl:118 [inlined]
 [4] anonymous at ./<missing>:?

I couldn't.

ExpandingMan commented 6 years ago

I don't think that's the right function to call, you are probably looking for FlatBuffers.read. You'll probably need to check out the FlatBuffers.jl documentation and the original flatbuffers documentation linked within. It would probably take a bit of time to look through Feather.jl and really understand what it's doing, but it's a pretty complete example of what you could do with Arrow.jl and fortunately isn't very much code.

gabomgp commented 6 years ago

@randyzwitch I'm using a custom format based in Arrow Stream IPC format as payload in some REST Services. The backed is Python and the frontend Angular. And we need to integrate Julia in some microservices and I really would like to use Arrow as the IPC format between Julia and Python too. So, maybe you can point me in the correct direction to solve the reading and writing of tables using the IPC format of Arrow...?

randyzwitch commented 6 years ago

@gabomgp I haven't had time to build this in Julia. Most likely, I will be wrapping Arrow c_glib in the near term, then figuring out if it's worth the time for me to do it in native Julia.

The only references I've been working from is the Arrow documentation site and reading the Arrow developer email list

ExpandingMan commented 6 years ago

Unfortunately when I wrote this package I had a perspective which was rather skewed by the goal of using the Feather format which, for some reason, seemingly makes a bare minimal effort to provide arrow compatible data (it uses a metadata format that seemingly has nothing to do with arrow).

A better name for this package likely would have been ArrowArrays.jl.

randyzwitch commented 6 years ago

I don't think it's the case that this package isn't really "Arrow", just that there has been some drift. I do hope to get to the point where I can provide the IPC code back into this package, just that I really need to show some progress soon and wrapping the 3 C++ functions that I need is probably the quickest way forward!

zhouyan commented 5 years ago

Is there any plan or interest in making this package fully compatible with the main Arrow project on the IPC area, in particular at least feature parity with pyarrow. @ExpandingMan If you think it is worthy goal I would like to take a shot at it unless some work is already under way.

It will take me some time, even months before I can contribute back a workable PR, but in the end this is what I would like to achieve: Support for Message/RecordBatch and a bare minimum Table type that facilitate such IPC transfer, which may not be feature rich but easily convertible to other data type such a DataFrame for more user friendly use cases.

The Feather format has its limitations, most of all it doesn't really support all the data types Arrow supported. I have had some good experience in using Arrow as a exchange format for data between R/Julia/Python/C++. However for now I am limited to only using data types supported by Feather.

zhouyan commented 5 years ago

https://github.com/zhouyan/Arrow.jl/tree/feature/ipc

I created a simple proof-of-concept of reading record batch stream last night. Datetime types can be easily added with little effort and List can be supported with only a little effort as well. The test case binary file RecordBatchStream.out is generated from C++. A lot work need to be done before it is at least a proper prototype though.

ExpandingMan commented 5 years ago

I had started work on this in this branch. I made good progress, but I got derailed from it, for various reasons. I still intend to work on it, but I can't say when that might be.

zhouyan commented 5 years ago

How about I work on it and you can review the results when you got time?

ExpandingMan commented 5 years ago

I'm not really interested in taking responsibility for maintenance of a full-blown C++ wrapper. However, if you create a full working C++ wrapper package, I would be glad to de-register this Arrow.jl so that you can have the name and be a "standard" Julia arrow package if you so wish. I'd even help with moving Feather.jl over to that package.

zhouyan commented 5 years ago

I am not interested in a C++ wrapper, I am interested in a pure Julia implementation. The reason I generated the test file with C++ is because I haven't implemented the write part yet, and besides I think it is important to test cross-language compatibility, that is Julia implementation shall be able to read record batches written by other implementations and vice versa

ExpandingMan commented 5 years ago

The reason I'm hesitating to embrace this is that I sort of feel that this has to be rebuilt from the ground up, like you see in my new branch. If you were to do that, it'd make more sense for it to be your package. That'd be great, I'd be very happy if you or anyone else came along with a full working package, I'm just not sure what my involvement is at that point.

I'd say just go ahead and do what you wanted to do. Again, if you have a full working version, we can make sure that Arrow.jl is registered to that. I can probably help out to some extent, but business with other things combined with waning interest in the arrow spec has made it hard for me to get that motivated with this lately. That could change if I suddenly found myself with lots of use for it.

zhouyan commented 5 years ago

Sure, no problem. But I will likely need to borrow a lot logic already in this package, maybe even some code, if I am gonna start a new package from ground up. Of course credit will be given where it’s due but I don’t want to reinvent the wheel either. If that’s OK with you then I will be glad to start a new package with full implementation of Arrow in mind and take responsibility for it once it’s done.

On Jun 3, 2019, at 09:52, ExpandingMan notifications@github.com wrote:

The reason I'm hesitating to embrace this is that I sort of feel that this has to be rebuilt from the ground up, like you see in my new branch. If you were to do that, it'd make more sense for it to be your package. That'd be great, I'd be very happy if you or anyone else came along with a full working package, I'm just not sure what my involvement is at that point.

I'd say just go ahead and do what you wanted to do. Again, if you have a full working version, we can make sure that Arrow.jl is registered to that. I can probably help out to some extent, but business with other things combined with waning interest in the arrow spec has made it hard for me to get that motivated with this lately. That could change if I suddenly found myself with lots of use for it.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

ExpandingMan commented 5 years ago

Sure, take whatever you want. That's the reason it's open source with an permissive license.

By the way, I suggest you check out my new branch and try to stick closer to that than the old stuff. The new branch is a lot more well thought out.

zhouyan commented 5 years ago

sure, thanks. I probably get around to do it sometime next month after some work stuff sorted out. Meanwhile I will study the new branch and existing code to plan the structure and design.

davidanthoff commented 5 years ago

de-register this Arrow.jl

if you have a full working version, we can make sure that Arrow.jl is registered to that

I don't think these are options, even with the new package manager. I think there are two potential ways forward: a) a new package with a new name, or b) a PR that just changes the content of this package here.

ExpandingMan commented 5 years ago

I'm not sure whether it's possible, but please don't panic. I did not mean to imply that this was imminent by any means, if this does happen I'm sure it would be a good while before it does and I will make sure it is done carefully.

That said, this has piqued my interest a bit and I may go back to this this week.

If I can at least get a subset of it up and running, I think it would be far easier for anyone else to complete it than if I left it to its current state.

zhouyan commented 5 years ago

Let’s not worry much about the registry for now. It will be quite a while before I can even start working on it at full speed. When and it’s done we can then assess if it is suitable as a replacement or a separate package or maybe it is worthless junk.

Sent from my iPad

On Jun 4, 2019, at 05:52, ExpandingMan notifications@github.com wrote:

I'm not sure whether it's possible, but please don't panic. I did not mean to imply that this was imminent by any means, if this does happen I'm sure it would be a good while before it does and I will make sure it is done carefully.

That said, this has piqued my interest a bit and I may go back to this this week.

If I can at least get a subset of it up and running, I think it would be far easier for anyone else to complete it than if I left it to its current state.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

randyzwitch commented 5 years ago

When you get your code back @expandingman, I’ve got a real-world use case to test IPC on and would be happy to do so

On Jun 3, 2019, at 5:52 PM, ExpandingMan notifications@github.com wrote:

I'm not sure whether it's possible, but please don't panic. I did not mean to imply that this was imminent by any means, if this does happen I'm sure it would be a good while before it does and I will make sure it is done carefully.

That said, this has piqued my interest a bit and I may go back to this this week.

If I can at least get a subset of it up and running, I think it would be far easier for anyone else to complete it than if I left it to its current state.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

ExpandingMan commented 5 years ago

I have to say that all this interest is piquing my interest. I'm at least going to get a working minimal prototype out of my new branch, I don't know if I'll finish it, but really once I get the metadata reading in in a robust way there'll at least be a minimal working version. Stay tuned.

zhouyan commented 5 years ago

You can have a look of my branch of reading in the Metadata and the Message body into Primitive

https://github.com/zhouyan/ArrowFork.jl/tree/feature/ipc https://github.com/zhouyan/ArrowFork.jl/tree/feature/ipc

I haven’t worked on it since the first couple commits a few nights ago. I got stuck with some work stuff and haven’t got around to work the rest of it yet. After the talk of the new package I am playing around with an alternative implementation design. It will make a small scarifies of performance if one want to access the ArrowVector directly but easier to serialize and deserialize. The goal is to get Arrow data in and out of native Julia structures fast (e.g., often from one of the IPC format). My main reasoning is that working with Arrow data may incur a cost at the scale of memcpy to get it into native structures, but the overall performance shall be better because of all the optimized code that has been written around the native structures. Essentially making ArrowVector a mid-way structure between Julia and IPC format.

On Jun 4, 2019, at 07:32, ExpandingMan notifications@github.com wrote:

I have to say that all this interest is piquing my interest. I'm at least going to get a working minimal prototype out of my new branch, I don't know if I'll finish it, but really once I get the metadata reading in in a robust way there'll at least be a minimal working version. Stay tuned.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ExpandingMan/Arrow.jl/issues/4?email_source=notifications&email_token=AAEDP2M6K3HKQISFEMTKZVLPYWSZVA5CNFSM4EREXXJ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODW27GLY#issuecomment-498463535, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEDP2PZXBBUM35CJD36X4LPYWSZVANCNFSM4EREXXJQ.

ExpandingMan commented 5 years ago

Ideally the ArrowVectors should be views into data which are created by reading the metadata. This way, the overhead of creating them should be the same as the overhead of creating the metadata plus a few small allocations, accessing them should be basically free (there's usually very little to allocate in the way of metadata). That way you could also do no copy memory mapping. Again, my new branch is much cleaner.

zhouyan commented 5 years ago

Yea, that’s the idea,

Buffers are only allocated when creating it from Julia types to be serialized (and not always necessary). While reading the IPC format it contains the metadata and reference to the message body as view. And the memcpy cost happens when it is accessed to be converted to say Vector{T}, That cost technically can be avoided but some time it is better the price once instead of in a lot small places. A simple example, for a nullable array, indexing into the array means bitwise operations on the null_bitmap while working with Vector{Union{Missing,T}} is more convenient and sometime yield better performance. Of course, for non-nullable array, accessing the raw buffer is as fast as access Vector{T} minus the cost of copy. However, I would like to avoid the distinction between say NullablePrimitive and Primitive, or List etc, just one unified parameter type ArrowVector{T} where the parameter T (= Int32, List{Int32}, …) affect how the buffers are accessed via method dispatch, while the member data are the same regardless of the types (similar to ArrayData in the reference C++ implementation). So we pay a little space cost on the stack for say Primitive types but a much cleaner structure for IPC.

On Jun 4, 2019, at 07:53, ExpandingMan notifications@github.com wrote:

Ideally the ArrowVectors should be views into data which are created by reading the metadata. This way, the overhead of creating them should be the same as the overhead of creating the metadata plus a few small allocations, accessing them should be basically free (there's usually very little to allocate in the way of metadata). That way you could also do no copy memory mapping. Again, my new branch is much cleaner.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ExpandingMan/Arrow.jl/issues/4?email_source=notifications&email_token=AAEDP2LCC3MFFU3QSPQMS43PYWVJHA5CNFSM4EREXXJ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODW3AFQI#issuecomment-498467521, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEDP2JMTCL3GC4DHNRYB4LPYWVJHANCNFSM4EREXXJQ.

ExpandingMan commented 5 years ago

I found that avoiding the distinction between e.g. Primitive and NullablePrimitive is more effort than it's worth. I also found that one really should avoid conflating the "logical" memory types with the "base" data types. This was one of my major mistakes the first time around that caused me to decide to re-implement the whole thing. You'll notice, for example, that my new List objects would now have element types of Vector{UInt8} rather than String. This is quite deliberate, even though it will require another layer of wrappers.

Of course, how you do it is entirely up to you, but I did learn a few lessons the first time around on this.

Anyway, are you on the Julia slack? It would probably be better to discuss these things on there.

zhouyan commented 5 years ago

Not yet but I can join the Julia slack if someone send an invite or link. Sure, I probably will learn the same lesson after a few tries. I come from working the C++ implementation. At first I hated how it handled the type system but overtime I started to see some of the reasons behind it

ExpandingMan commented 5 years ago

See here for slack invites if interested.

zhouyan commented 5 years ago

thanks, just joined.