apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.44k stars 3.52k forks source link

How does arrow parse its ipc format? #13875

Closed liusitan closed 2 years ago

liusitan commented 2 years ago

Hi there, I am trying to hack the arrow IPC format, I am confused about how does arrow differentiate between different type in record batch buffers and parse it.

image

for example, now I store a data frame

name | age | balance
'jack'    | 12.  |  100.23
'Jennie' | 24   | 2000.34

I believe 'jack' and 'jennie' is stored in one buffer and 12 and 24 are stored in another buffer 100.23 and 2000.34 is stored in another buffer

how does the arrow deserializer know which tools to use to parse the buffer?

liusitan commented 2 years ago

the reason why I am hacking the Arrow ipc format is that recently I am implementing a fuse filesystem for vineyard, which is an immutable storage manager that utilize the columnar format as the arrow.

We decided to enable our clients to access vineyard objects by reading from the Arrow ipc format. which means when the client wants to access the objects stored in the vineyard, the fuse file system will searlize the corresponding vienayrd objects on the fly, store it in the fuse process, and provide the serialized Arrow-formatted vineyard objects to clients. However, this approach may lead to heavy memory usage.

We are thinking, is it possible to create a mapping between the Arrow ipc format to the vineyard objects, in terms of the information stored in the vineyard objects' metadata, it's totally possible, especially in terms of the dataframe, we store that in units of column as well.

Theoretically, if a user wants to access the 100 byte to 200 btyes of the Arrow-formatted vineyard objects, conceptually, that's a range of data in the first column, my implementation can realize its conceptual representation from the byte range, and grab the data from vineyard, serialized data, provide what client wants.

Practically, I haven't found a way to precompute the sizes of each part of the serialized Arrow ipc format for now, given the documentation so far. After doing some question compression, I raised the question above.

zeroshade commented 2 years ago

Hey @liusitan, the first message in an Arrow IPC stream should always be a Schema message which describes the actual schema of the following record batch messages. So anything processing an IPC RecordBatch message should already have the expected schema for that record batch.

Then you have the nodes field of the flatbuffer message which is a flattened version of the logical schema. So for example, if you had the following schema:

col1: Struct<a: Int32, b: List<item: Int64>, c: Float64>
col2: Utf8

That Nodes list would be:

FieldNode 0: Struct name='col1'
FieldNode 1: Int32 name='a'
FieldNode 2: List name='b'
FieldNode 3: Int64 name='item'
FieldNode 4: Float64 name='c'
FieldNode 5: Utf8 name='col2'

Each of those FieldNode objects contains the appropriate metadata describing the length of the array and the number of nulls. The recordbatch message also has that buffers list which is a list of the offset and length of the buffers that are sent in the message body. Thus, for the above described example we'd expect the buffers described to correspond to:

buffer 0: field 0 validity bitmap
buffer 1: field 1 validity bitmap
buffer 2: field 1 values buffer (ie: a buffer of int32 values)
buffer 3: field 2 validity bitmap
buffer 4: field 2 offsets (ie: the offsets buffer for the List array, a buffer of int32 values)
buffer 5: field 3 validity bitmap
buffer 6: field 3 values (ie: a buffer of int64 values)
buffer 7: field 4 validity bitmap
buffer 8: field 4 values (ie: a buffer of float64 values)
buffer 9: field 5 validity bitmap
buffer 10: field 5 offsets (ie: the offsets for col2, the utf8 array, a buffer of int32 values)
buffer 11: field 5 data (ie: the data buffer for the utf8 column, all of the string values)

For your provided example:

name | age | balance
'jack'    | 12.  |  100.23
'Jennie' | 24   | 2000.34

The first message in the stream would be the schema:

name: utf8
age: Int32
balance: Float64

the nodes in the RecordBatch message (the second message of the stream) would contain:

FieldNode 0: Utf8 name='name', length=2, nulls=0
FieldNode 1: Int32 name='age', length=2, nulls=0
FieldNode 2: Float64 name='balance', length=2, nulls=0

So now you can parse the buffers because you know the types:

buffer 0: validity bitmap for name column (should be 1 byte)
buffer 1: offsets for name column (should be the int32 values [0, 4, 10])
buffer 2: data buffer for name column (should be 'jackJennie')
buffer 3: validity bitmap for age column (should be 1 byte)
buffer 4: values for age column (should be int32 values [12, 24], ie: 8 bytes)
buffer 5: validity bitmap for balance column
buffer 6: values for balance column (should be the bytes for [100.23, 2000.34], ie: 16 bytes)

Hope that helps

liusitan commented 2 years ago

hi @zeroshade thanks for sharing and replying, I am confused about the field node that you wrote down. Because, according to the fbs file on the main branch, it did not define the type, are you referring to something else?

FieldNode 0: Utf8 name='name', length=2, nulls=0 FieldNode 1: Int32 name='age', length=2, nulls=0 FieldNode 2: Float64 name='balance', length=2, nulls=0

image
zeroshade commented 2 years ago

@liusitan the FieldNode's type is coming from the Schema that we already know from the first message (I included the type in my example to help illustrate which node they corresponded to)

liusitan commented 2 years ago

Thanks for the lightning response first! got it, I am looking at the schema.fbs now

image

I think you are saying that the Fields in the Schema type includes the type info, and according to the Field type definition, Field type screenshot

image

I can use the type underlined above to determine the type of the field node, right? I notice it comments that This is the type of the decoded value if the field is dictionary encoded. what does dictionary encoded mean? is the example below dictionary encoded name | age | balance 'jack' | 12. | 100.23 'Jennie' | 24 | 2000.34

zeroshade commented 2 years ago

I can use the type underlined above to determine the type of the field node, right?

Correct!

I notice it comments that This is the type of the decoded value if the field is dictionary encoded. what does dictionary encoded mean? is the example below dictionary encoded...

Dictionary encoding is a specific type of Arrow array that is ideal for large arrays with low cardinality. You can find a full description of dictionary-encoded arrays here in the Arrow docs.

liusitan commented 2 years ago

got it thank you!