[Format] Passing column statistics through Arrow C data interface

ianmcook commented 11 months ago

Describe the enhancement requested

Is there any standard or convention for passing column statistics through the C data interface?

For example, say there is a module that reads a Parquet file into memory in Arrow format then passes the data arrays to another module through the C data interface. If the Parquet file metadata includes Parquet column statistics such as distinct_count, max, min, and null_count, can the sending module pass those statistics through the C data interface, to allow the receiving module to use the statistics to perform computations more efficiently?

Component(s)

Format

zeroshade commented 11 months ago

There's no current standard convention, there's a few ways such could be sent though such as via a struct array with one row and fields for each of those or various other configurations.

Passing such statistics would be beyond the scope of the current C Data interface IMHO

ianmcook commented 11 months ago

cc @Tishj @pdet

There is a related discussion at https://github.com/duckdb/duckdb/discussions/4636#discussioncomment-7496580

pdet commented 11 months ago

There's no current standard convention, there's a few ways such could be sent though such as via a struct array with one row and fields for each of those or various other configurations.

Passing such statistics would be beyond the scope of the current C Data interface IMHO

I would argue that it makes sense to include statistical information in the C data interface. Efficient execution of complex queries requires statistical information, and I believe that most Arrow Producers possess this information. Therefore, it should be somehow passed. An alternative I can think of for the C-Data interface is to wrap the statistical information in top-level objects (e.g., ArrowRecordBatch in Python), but that approach is quite cumbersome and would necessitate specific implementations for every client API.

pdet commented 6 months ago

Just to provide a clearer example of how statistics can be useful for query optimization.

In DuckDB, join ordering is currently determined using heuristics based on table cardinalities with future plans to enhance this with sample statistics.

Not only join ordering is affected by statistics but even the choice of the probe side in a hash join, will be determined based on the expected cardinalities.

One example of a query that is affected by join ordering is Q 21 of tpch. However, the plan for it is too big to share it easily in a GitHub Discussion.

To give a simpler example of how cardinalities affect this, I've created two tables.

t - Has 10^8 rows, ranging from 0 to 10^8
t_2 - Has 10 rows, ranging from 0 to 10

My example query is a simple inner join of these two tables and we calculate the sum of t.i.

SELECT SUM(t.i) from t inner join t_2 on (t_2.k = t.i)

Because the optimizer doesn't have any information of statistics from the Arrow side, it will basically pick the probe side depending on what's presented in the query.

SELECT SUM(t.i) from t_2 inner join t on (t_2.k = t.i)

This would result in a slightly different plan, yet with significant differences in performance.

As depicted in the screenshot of executing both queries, choosing the incorrect probe side for this query already results in a performance difference of an order of magnitude. For more complex queries, the variations in execution time could be not only larger but also more difficult to trace.

For reference, the code I used for this example:

import duckdb
import time
import statistics

con = duckdb.connect()

# Create table with 10^8
con.execute("CREATE TABLE t as SELECT * FROM RANGE(0, 100000000) tbl(i)")
# Create Table with 10
con.execute("CREATE TABLE t_2 as SELECT * FROM RANGE(0, 10) tbl(k)")

query_slow = '''
SELECT SUM(t.i) from t inner join t_2 on (t_2.k = t.i);
'''

query_fast = '''
SELECT SUM(t.i) from t_2 inner join t on (t_2.k = t.i);'''

t = con.execute("FROM t").fetch_arrow_table()
t_2 = con.execute("FROM t_2").fetch_arrow_table()

con_2 = duckdb.connect()

print("DuckDB Arrow - Query Slow")

print(con_2.execute("EXPLAIN " + query_slow).fetchall()[0][1])

execution_times = []

for _ in range(5):
    start_time = time.time()
    con_2.execute(query_slow)
    end_time = time.time()

    execution_times.append(end_time - start_time)

median_time = statistics.median(execution_times)

print(median_time)

print("DuckDB Arrow - Query Fast")

print(con_2.execute("EXPLAIN " + query_fast).fetchall()[0][1])

execution_times = []

for _ in range(5):
    start_time = time.time()
    con_2.execute(query_fast)
    end_time = time.time()

    execution_times.append(end_time - start_time)

median_time = statistics.median(execution_times)

print(median_time)

kou commented 6 months ago

I'm considering some approaches for this use case. This is not completed yet but share my idea so far. Feedback is appreciated.

ADBC uses the following schema to return statistics:

https://github.com/apache/arrow-adbc/blob/3f80831d12b6e5a78a4321f67e28d652951241cf/adbc.h#L1739-L1778

It's designed for returning statistics of a database.

We can simplify this schema because we can just return statistics of a record batch. For example:

Field Name	Field Type	Comments
column_name	utf8	(1)
statistic_key	int16 not null	(2)
statistic_value	`VALUE_SCHEMA` not null
statistic_is_approximate	bool not null	(3)

If null, then the statistic applies to the entire table.
A dictionary-encoded statistic name (although we do not use the Arrow dictionary type). Values in [0, 1024) are reserved for ADBC. Other values are for implementation-specific statistics. For the definitions of predefined statistic types, see adbc-table-statistics. To get driver-specific statistic names, use AdbcConnectionGetStatisticNames().
If true, then the value is approximate or best-effort.

VALUE_SCHEMA is a dense union with members:

Field Name	Field Type
int64	int64
uint64	uint64
float64	float64
binary	binary

TODO: How to represent statistic key? Should we use ADBC style? (Assigning an ID for each statistic key and using it.)

If we represent statistics as a record batch, we can pass statistics through Arrow C data interface. This may be a reasonable approach.

If we use this approach, we need to do the followings:

Define a schema as a specification
Add statistics related APIs to Apache Arrow C++ and other implementation because we need two more implementations for specification change
- https://arrow.apache.org/docs/format/Changing.html#at-least-two-reference-implementations
- This is not a format change but it's better that we should follow the rule
- We can work on this for Apache Arrow C++ before we propose a specification because statistics will be useful for general propose
Apache Arrow C++: Add support for importing statistics from Apache Parquet C++
...

TODO: Consider statistics related API for Apache Arrow C++.

zeroshade commented 6 months ago

I'd be curious what others think of this approach as opposed to actually making a format change to include statistics alongside the record batches in the API. Particular in the case of a stream of batches.

I'm not against it, I just don't know if others would be opposed to needing an entirely separate record batch being sent containing the statistics

ianmcook commented 6 months ago

I'd be curious what others think of this approach as opposed to actually making a format change to include statistics alongside the record batches in the API

I think the top priority should be to avoid breaking ABI compatibility. I suspect that most users of the C data interface will not want to pass statistics. We should avoid doing anything that would cause disruption for those users.

kou commented 6 months ago

Ah, I should have written an approach that changes the current Arrow C data interface. I'll write it later.

mapleFU commented 6 months ago

If the Parquet file metadata includes Parquet column statistics such as distinct_count, max, min, and null_count, can the sending module pass those statistics through the C data interface, to allow the receiving module to use the statistics to perform computations more efficiently?

This proposal is great. Just a un-related issue, Parquet DistinctCount might merely used currently. And now Dataset Scanner might parsing the min-max / null as Expression

mapleFU commented 6 months ago

https://github.com/apache/arrow/issues/38837#issuecomment-2074371230

Is {u}int64/float64/string the only supported types? What would the min-max schema being when facing the different types?

lidavidm commented 6 months ago

At least for ADBC, the idea was that other types can be encoded in those choices. (Decimals can be represented as bytes, dates can be represented as int64, etc.)

kou commented 6 months ago

Some approaches that are based the C Data interface https://arrow.apache.org/docs/format/CDataInterface.html :

(1) Add `get_statistics` callback to `ArrowArray`

For example:

struct ArrowArray {
  // Array data description
  int64_t length;
  int64_t null_count;
  int64_t offset;
  int64_t n_buffers;
  int64_t n_children;
  const void** buffers;
  struct ArrowArray** children;
  struct ArrowArray* dictionary;

  // Callback to return statistics of this ArrowArray
  struct ArrowArray *(*get_statistics)(struct ArrowArray*);
  // Release callback
  void (*release)(struct ArrowArray*);
  // Opaque producer-specific data
  void* private_data;
};

This uses a struct ArrowArray to represent statistics like https://github.com/apache/arrow/issues/38837#issuecomment-2074371230 but we can define struct ArrowStatistics or something instead.

Note that this is a backward incompatible change. struct ArrowArray doesn't have version information nor spaces for extension. We can't do this without breaking backward compatibility.

(2) Add statistics to `ArrowSchema::metadata`

https://arrow.apache.org/docs/format/CDataInterface.html#c.ArrowSchema.metadata

If we choose this approach, we will preserve some metadata key such as ARROW:XXX like we did for IPC format: https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata

Here are some ideas how to put statistics into ArrowSchema::metadata:

Use struct ArrowArray* (pointer) as ARROW:statistics metadata value
Use multiple metadata to represent statistics

Here is an example for the 2. approach:

{
  "ARROW:statistics:column1:max": 2.9,
  "ARROW:statistics:column1:max:approximate": true,
  "ARROW:statistics:column2:average_byte_width": 29.9
}

TODO:

How to encode each value (2.9, true and 29.9) to raw byte data? We can use only raw byte data for a value of ArrowSchema::metadata.
Can we support same name columns with this approach?
This isn't space effective because we have many duplicated texts such as ARROW:statistics:.

Note that this is a (roughly) backward compatible change. I think that most users don't use ARROW:XXX as metadata key.

This may not work with the C stream interface https://arrow.apache.org/docs/format/CStreamInterface.html . Because it shares one struct ArrowSchema with multiple struct ArrowArray. Each struct ArrowArray will have different statistics.

lidavidm commented 6 months ago

Do consumers want per-batch metadata anyways? I would assume in the context of say DuckDB is that they'd like to get statistics for the whole stream up front, without reading any data, and use that to inform their query plan.

kou commented 6 months ago

Ah, then we should not mix statistics and ArrowArray. ArrowArray::get_statistics() may be too late.

lidavidm commented 6 months ago

This is just off the cuff, but maybe we could somehow signal to an ArrowArrayStream that a next call to get_next should instead return a batch of Arrow-encoded statistics data. That wouldn't break ABI, so long as we come up with a good way to differentiate the call. Then consumers like DuckDB could fetch the stats up front (if available) and the schema would be known to them (if we standardize on a single schema for this).

kou commented 6 months ago

Hmm. It may be better that we provide a separated API to get statistics like https://github.com/apache/arrow/issues/38837#issuecomment-2074371230 approach. It may be confused that ArrowArrayStream::get_next() returns statistics or data.

lidavidm commented 5 months ago

We talked about this a little, but what about approach (2) from Kou's comment above, but for now only defining table-level statistics (primarily row count)? AIUI, row count is the important statistic to have for @pdet's use case, and it is simple to define. We can wait and see on more complicated or column-level statistics.

Also for the ArrowDeviceArray, there is some room for extension:

https://github.com/apache/arrow/blob/14c54bbfb7d9305e79a2c2d016c34a655773e5cb/cpp/src/arrow/c/abi.h#L134-L135

Could that be useful here? (Unfortunately there isn't any room for extension on ArrowDeviceArrayStream.)

kou commented 5 months ago

Thanks for sharing our talked idea.

I took a look at the DuckDB implementation. It seems that DucDB uses only column-level statistics:

duckdb::TableFunction::statistics returns the statistics of a specified column:

https://github.com/duckdb/duckdb/blob/main/src/include/duckdb/function/table_function.hpp#L253-L255 https://github.com/duckdb/duckdb/blob/670cd341249e266de384e0341f200f4864b41b27/src/include/duckdb/function/table_function.hpp#L188-L189

duckdb::BaseStatistics doesn't have row count. It has distinct count, have NULL and have non-NULL:

https://github.com/duckdb/duckdb/blob/670cd341249e266de384e0341f200f4864b41b27/src/include/duckdb/storage/statistics/base_statistics.hpp#L38-L146

It seems that a numeric/string column can have min/max statistics:

https://github.com/duckdb/duckdb/blob/670cd341249e266de384e0341f200f4864b41b27/src/include/duckdb/storage/statistics/numeric_stats.hpp#L22-L31 https://github.com/duckdb/duckdb/blob/670cd341249e266de384e0341f200f4864b41b27/src/include/duckdb/storage/statistics/string_stats.hpp#L23-L36

(A string column can have more statistics such as have Unicode and max length.)

Hmm. It seems that column-level statistics is also needed for real word use cases.

pdet commented 5 months ago

Hey guys,

Thank you very much for starting the design of Arrow statistics! That's exciting!

We are currently interested in up-front full-column statistics. Specially:

Count Distinct (Approximate)
Cardinality of the table
Min-Max

As a clarification, we also utilize row-group min-max for filtering optimizations in DuckDB tables, but these cannot benefit Arrow. In Arrow, we either pushdown filters to an Arrow Scanner or create a filter node on top of the scanner, and we do not utilize Mix-Max of chunks for filter optimization.

For a more detailed picture of what we hold for statistics you can also look in our statistics folder.

But I think that approx count distinct, cardinality, and min-max are enough for a first iteration.

lidavidm commented 5 months ago

Table cardinality would be table level right? But of course the others are column level. Hmm. We didn't leave ourselves room in ArrowDeviceArrayStream...

And just to be clear "up front" means at the start of the stream, not per-batch, right?

pdet commented 5 months ago

Table cardinality would be table level right? But of course the others are column level. Hmm. We didn't leave ourselves room in ArrowDeviceArrayStream..

Exactly!

And just to be clear "up front" means at the start of the stream, not per-batch, right?

Yes, at the start of the stream.

zeroshade commented 5 months ago

@lidavidm technically the C Device structs are still marked as Experimental on the documentation and haven't been adopted by much of the ecosystem yet (as we're still adding more tooling in libarrow and the pycapsule definitions for using them) so adding room in ArrowDeviceArrayStream should still be viable without breaking people.

Either by adding members or an entire separate callback function?

lidavidm commented 5 months ago

I think it would be interesting to add an extra callback to get the statistics, yeah.

lidavidm commented 5 months ago

It's hard though, because arguably it's kind of outside the core intent of the C Data Interface. But on the other hand, they kind of need to be here to have any hope of being supported more broadly.

mapleFU commented 5 months ago

Count Distinct (Approximate)

Hmmm I don't know should we represent this "Approximate"

kou commented 5 months ago

It seems that DuckDB uses HyperLogLog for computing distinct count: https://github.com/duckdb/duckdb/blob/d26007417b7770860ae78278c898d2ecf13f08fd/src/include/duckdb/storage/statistics/distinct_statistics.hpp#L25-L26

It may be the reason why "Approximate" is included here.

mapleFU commented 5 months ago

Nice, I mean should "distinct count" ( or ndv ) be "estimated" in our data interface? And if we add a "estimated", should we add an "exact" ndv or just "estimated" is ok here?

In (1) https://github.com/apache/arrow/issues/38837#issuecomment-2088101530 , would adding a key here be heavy?

lidavidm commented 5 months ago

The ADBC encoding allows the producer to mark statistics as exact/approximate, fwiw

In (1) https://github.com/apache/arrow/issues/38837#issuecomment-2088101530 , would adding a key here be heavy?

See https://github.com/apache/arrow/issues/38837#issuecomment-2093202774

pdet commented 5 months ago

Exactly,

In general, systems don't provide exact distinct statistics because these are rather expensive to calculate. Hence, they do some approximate strategies. In DuckDB's case, it is a HyperLogLog implementation

kou commented 5 months ago

Thanks for sharing more information.

Here is my new idea:

It's based on the "(2) Add statistics to ArrowSchema::metadata in https://github.com/apache/arrow/issues/38837#issuecomment-2088101530 .

It puts all statistics to the metadata in the top-level ArrowSchema. But I noticed that we don't need to do it. We can put statistics for each child (column) to child ArrowSchema::metadata.

If we have a record batch that has int32 column1 and string column2, we have the following ArrowSchema:

ArrowSchema {
  .format = "+siu",
  .children = {
    ArrowSchema {
      .name = "column1",
      .format = "i",
    },
    ArrowSchema {
      .name = "column2",
      .format = "u",
    },
  },
}

We can put a ArrowArray* for statistics to each child ArrowSchema::metadata instead of putting all statistics to the top-level Arrow::Schema::metadata:

ArrowSchema {
  .format = "+siu",
  .metadata = {
    "ARROW:statistics" => ArrowArray*, /* table-level statistics such as row count */
  },
  .children = {
    ArrowSchema {
      .name = "column1",
      .format = "i",
      .metadata = {
        "ARROW:statistics" => ArrowArray*, /* column-level statistics such as count distinct */
      },
    },
    ArrowSchema {
      .name = "column2",
      .format = "u",
      .metadata = {
        "ARROW:statistics" => ArrowArray*, /* column-level statistics such as count distinct */
      },
    },
  },
}

ArrowArray* for statistics can use simpler schema than https://github.com/apache/arrow/issues/38837#issuecomment-2074371230 :

Field Name	Field Type	Comments
key	int16 not null	(1)
value	`VALUE_SCHEMA` not null
is_approximate	bool not null	(2)

A dictionary-encoded statistic name (although we do not use the Arrow dictionary type). Values in [0, 1024) are reserved for ADBC. Other values are for implementation-specific statistics. For the definitions of predefined statistic types, see adbc-table-statistics. To get driver-specific statistic names, use AdbcConnectionGetStatisticNames().
If true, then the value is approximate or best-effort.

VALUE_SCHEMA is a dense union with members:

Field Name	Field Type
int64	int64
uint64	uint64
float64	float64
binary	binary

lidavidm commented 5 months ago

Since it's now per-column, maybe we can just let the type of the underlying array be one of the VALUE_SCHEMA fields?

kou commented 5 months ago

I think that we can't do it. Because statistics of a column may have min/max, count distinct, max byte width and so on. For a string column, we can use binary for min/max but can't use binary for count distinct or max byte width. We need to use int64/uint64 for them. So I think that we need multiple types for value.

lidavidm commented 5 months ago

Right, we still need multiple types. But for min/max statistics, we can include the actual type as one of the possible types, right? (In other words, a sort of VALUE_SCHEMA<T>.)

kou commented 5 months ago

Ah, you meant that we can add int32 to VALUE_SCHEMA when column uses int32, right? Yes. We can do it.

But it may be difficult to use. If we may use different schema for each column, we also need to provide an ArrowSchema for each column via ArrowSchema::metadata. (ARROW:statistics_schema?)

lidavidm commented 5 months ago

Hmm, the caller should know the schema still so long as we always put the dynamically typed column (so to speak) at a fixed column index

kou commented 5 months ago

Ah, you're right.

kou commented 5 months ago

Here is the latest idea. Feedback is welcome.

It's based on https://github.com/apache/arrow/issues/38837#issuecomment-2096990408 and https://github.com/apache/arrow/issues/38837#issuecomment-2097062873 .

If we have a record batch that has int32 column1 and string column2, we have the following ArrowSchema. Note that metadata has "ARROW:statistics" => ArrowArray*.

ArrowSchema {
  .format = "+siu",
  .metadata = {
    "ARROW:statistics" => ArrowArray*, /* table-level statistics such as row count */
  },
  .children = {
    ArrowSchema {
      .name = "column1",
      .format = "i",
      .metadata = {
        "ARROW:statistics" => ArrowArray*, /* column-level statistics such as count distinct */
      },
    },
    ArrowSchema {
      .name = "column2",
      .format = "u",
      .metadata = {
        "ARROW:statistics" => ArrowArray*, /* column-level statistics such as count distinct */
      },
    },
  },
}

ArrowArray* for statistics use the following schema:

Field Name	Field Type	Comments
key	int16 not null	(1)
value	`VALUE_SCHEMA` not null
is_approximate	bool not null	(2)

A dictionary-encoded statistic name (although we do not use the Arrow dictionary type). Values in [0, 1024) are reserved for Apache Arrow. The values should be aligned with ADBC. Other values are for implementation-specific statistics. For the definitions of predefined statistic types, see adbc-table-statistics.

TODO: Should we provide a feature to get driver-specific statistic names. ADBC has AdbcConnectionGetStatisticNames()? Or should we use string instead of int16?
If true, then the value is approximate or best-effort.

VALUE_SCHEMA is a dense union with members:

Field Name	Field Type	Comments
int64	int64
uint64	uint64
float64	float64
value	The same type of the `ArrowSchema` that is belonged to.	(3)

If the ArrowSchema's type is string, this type is also string.

TODO: Is value good name?

TODO: Should we embed VALUE_SCHEMA to the statistics schema something like the following?

Field Name	Field Type	Comments
key	int16 not null	(1)
value_int64	int64	(4)
value_uint64	uint64	(4)
value_float64	float64	(4)
value	The same type of the `ArrowSchema` that is belonged to.	(3) (4)
is_approximate	bool not null	(2)

One of them is "not null".

kou commented 5 months ago

TODO: Should we provide a feature to get driver-specific statistic names. ADBC has AdbcConnectionGetStatisticNames()? Or should we use string instead of int16?

FYI: Here is the reason why ADBC chose dictionary encoding:

https://github.com/apache/arrow-adbc/issues/685#issuecomment-1593400443

Do we encode the statistic names as strings, or requiring dictionary encoding, or specifying an enumeration? (I would prefer dictionary encoding, but this complicates implementation a bit. The benefit is that if we specify some fixed dictionary values, we can save space on the common values and avoid lots of string comparisons while still allowing self-describing extensibility by vendors)

lidavidm commented 5 months ago

Since there's no available "side channel" here, string names probably make more sense

lidavidm commented 5 months ago

TODO: Should we embed VALUE_SCHEMA to the statistics schema something like the following?

That would essentially be a sparse union? Though I guess if you assume the caller knows the right type for a particular kind of statistic you can save a bit on the encoding, and presumably there aren't enough different statistics for the extra allocated space to matter (as compared to a dense union)

lidavidm commented 5 months ago

Just to be clear, when we say

    "ARROW:statistics" => ArrowArray*,

this means the address of the ArrowArray will be encoded (as a base 10 string?) in the metadata?

kou commented 5 months ago

Since there's no available "side channel" here, string names probably make more sense

OK. Let's use string for statistic key.

TODO: Should we embed VALUE_SCHEMA to the statistics schema something like the following?

That would essentially be a sparse union?

Yes.

Though I guess if you assume the caller knows the right type for a particular kind of statistic you can save a bit on the encoding, and presumably there aren't enough different statistics for the extra allocated space to matter (as compared to a dense union)

This idea is not for space efficient. I thought this may be easier to use. Union may be a bit complicated. But users need to know which column is used for each statistics key as you mentioned. (e.g. distinct_count uses value_uint64 and max_value uses value.) This may be harder to use for implementation specific statistics.

Let's use union like ADBC does.

But we have a problem for the union approach:

TODO: Is value good name?

value.value to refer VALUE_SCHEMA's value from the top-level record batch ({key, value, is_approximate}) may be a bit strange...

Just to be clear, when we say
    "ARROW:statistics" => ArrowArray*,
this means the address of the ArrowArray will be encoded (as a base 10 string?) in the metadata?

Yes. I should have mentioned it explicitly.

lidavidm commented 5 months ago

If the caller doesn't import the statistics array, how will it get released?

kou commented 5 months ago

How about applying the "Member allocation" semantics?

https://arrow.apache.org/docs/format/CDataInterface.html#member-allocation

Therefore, the consumer MUST not try to interfere with the producer’s handling of these members’ lifetime. The only way the consumer influences data lifetime is by calling the base structure’s release callback.

The current ImportRecordBatch() doesn't have an option that doesn't call the statistics array's release(). So we need to add more APIs for this use case (import member array not base array) something like ImportMemberRecordBatch().

Or we can add Schema::statistics() and ImportSchema() import ARROW:statistics automatically.

lidavidm commented 5 months ago

Ah, that works. I suppose we could also just make release a no-op for the child.

kou commented 5 months ago

It's a good idea!

lidavidm commented 5 months ago

@ianmcook @zeroshade how does the sketch sound?

kou commented 5 months ago

Updated version. Feedback is welcome. I'll share this idea to the dev@arrow.apache.org mailing list too.

It's based on https://github.com/apache/arrow/issues/38837#issuecomment-2108891730 .

If we have a record batch that has int32 column1 and string column2, we have the following ArrowSchema. Note that metadata has "ARROW:statistics" => ArrowArray*. ArrowArray* is a base 10 string of the address of an ArrowArray because we can use only string for metadata value. You can't release the statistics ArrowArray*. (Its release is a no-op function.) It follows https://arrow.apache.org/docs/format/CDataInterface.html#member-allocation semantics. (The base ArrowSchema owns statistics ArrowArray*.)

ArrowSchema {
  .format = "+siu",
  .metadata = {
    "ARROW:statistics" => ArrowArray*, /* table-level statistics such as row count */
  },
  .children = {
    ArrowSchema {
      .name = "column1",
      .format = "i",
      .metadata = {
        "ARROW:statistics" => ArrowArray*, /* column-level statistics such as count distinct */
      },
    },
    ArrowSchema {
      .name = "column2",
      .format = "u",
      .metadata = {
        "ARROW:statistics" => ArrowArray*, /* column-level statistics such as count distinct */
      },
    },
  },
}

ArrowArray* for statistics use the following schema:

Field Name	Field Type	Comments
key	string not null	(1)
value	`VALUE_SCHEMA` not null
is_approximate	bool not null	(2)

We'll provide pre-defined keys such as max, min, byte_width and distinct_count but users can use application specific keys too.
If true, then the value is approximate or best-effort.

VALUE_SCHEMA is a dense union with members:

Field Name	Field Type	Comments
int64	int64
uint64	uint64
float64	float64
value	The same type of the `ArrowSchema` that is belonged to.	(3)

If the ArrowSchema's type is string, this type is also string.

TODO: Is value good name?

lidavidm commented 5 months ago

One nit:

Its release is NULL

It probably shouldn't be NULL since either (1) caller may expect it to be non-NULL and try to call it or (2) caller may assume NULL means that the array was already released

It would be safer to have a release() that just does nothing.

kou commented 5 months ago

Ah, sorry. You're right. It should be a no-op release(). I'll fix the comment.

apache / arrow