apache / arrow-nanoarrow

Helpers for Arrow C Data & Arrow C Stream interfaces
https://arrow.apache.org/nanoarrow
Apache License 2.0
168 stars 35 forks source link

Schema DSL for testing #566

Open bkietz opened 1 month ago

bkietz commented 1 month ago

Arrow C++ includes factories for constructing schemas, types, fields, and metadata which allow construction of even deeply nested structures to be expressive:

schema({
  field("some_col", int32(), key_value_metadata({
    {"some_key_field", "some_value_field"},
  })),
}, key_value_metadata({{"some_key", "some_value"}})),

It should be straightforward to write equivalent factories which build a nanoarrow::UniqueSchema.

bkietz commented 1 month ago

This should include a schema equality utility too

paleolimbot commented 1 month ago

We could certainly replicate Arrow C++'s syntax here, although I am hesitant to add scope to nanoarrow or make it seem like we are trying to replace anything about Arrow C++.

This should include a schema equality utility too

We have a few places that do something like this...for integration testing we have one that is slow (and somewhat specific to the types of schemas that show up in the integration testing) but generates a nice diff:

https://github.com/apache/arrow-nanoarrow/blob/2040e74add0a3c8a36877bce35c7dc43c27ba0e4/src/nanoarrow/integration/c_data_integration.cc#L151-L162

...and in Python we have one (that should almost certainly be written in C) that performs the check but doesn't generate very useful output on failure:

https://github.com/apache/arrow-nanoarrow/blob/2040e74add0a3c8a36877bce35c7dc43c27ba0e4/python/src/nanoarrow/_schema.pyx#L349-L402

Both of those are pretty specific to exactly what we needed them for.

paleolimbot commented 1 month ago

I sent this to you offline as well but I'll post here too! For generating integration test JSON we had a similar situation to serializing IPC schemas and went with a helper function plus a lambda to generate the full range of data types:

https://github.com/apache/arrow-nanoarrow/blob/2040e74add0a3c8a36877bce35c7dc43c27ba0e4/src/nanoarrow/testing/testing_test.cc#L496-L704

A similar example using Arrow C++ that would be nice to replace:

https://github.com/apache/arrow-nanoarrow/blob/2040e74add0a3c8a36877bce35c7dc43c27ba0e4/src/nanoarrow/ipc/decoder_test.cc#L671-L716

bkietz commented 1 month ago

I am hesitant to add scope to nanoarrow

If we keep it minimal and closely aligned with the ABI, 100-200 lines would suffice for:

  using namespace nanoarrow::testing::dsl;

  // declare a schema (default format is +s)
  UniqueSchema s = schema{
    // we can make the arguments look kwarg-like
    children{
      {"i", "my int field's name"},
      {"i", dictionary{{"u"}}, "my dictionary field's name",
       metadata{
           "some_key=some_value",
           "some_key2=some_value2",
       },
       ARROW_FLAG_NULLABLE},
    }
  };
paleolimbot commented 1 month ago

I like the idea of putting it in testing (it can move if it becomes popular). Replacing the usage in the Testing JSON generator would probably get you all the unit tests for free!

paleolimbot commented 1 month ago

In searching for Array equality utilities, I found that ADBC's validation utility also has a way to create schemas using nanoarrow for use in testing!

https://github.com/apache/arrow-adbc/blob/36f0cd32af2e3f75b12d4397d1ed9b6ecbc1acce/c/validation/adbc_validation_util.h#L252-L434