elixir-explorer / adbc

Apache Arrow ADBC bindings for Elixir
https://arrow.apache.org/adbc/
Apache License 2.0
50 stars 16 forks source link

Handle bad tag in get_info/get_objects #14

Closed josevalim closed 1 year ago

cocoa-xu commented 1 year ago

I noticed that the error message, size_object: bad tag for 0x0, was outputted by Erlang. So it means

ERL_NIF_TERM ret = array_stream->make_resource(env);

didn't return a valid Erlang NIF term.

cocoa-xu commented 1 year ago

arrow_array_to_nif_term seems problematic. It may have some out-of-bounds memory writes.

cocoa-xu commented 1 year ago

Actually, it's because we haven't supported some formats yet.

But I'm confused with type id in the ADBC's docs

A sparse_union<ints: int32, floats: float32> with type ids 4, 5 has format string +us:4,5; its two children have names ints and floats, and format strings i and f respectively.

Not sure why in this case the type id of int32 is 4 and 5 for float32.

More confusingly, for the Connection.get_info/{1,2},

  @doc """
  Get metadata about the database/driver.

  The result is an Arrow dataset with the following schema:

  Field Name                  | Field Type
  ----------------------------|------------------------
  info_name                   | uint32 not null
  info_value                  | INFO_SCHEMA

  `INFO_SCHEMA` is a dense union with members:

  Field Name (Type Code)      | Field Type
  ----------------------------|------------------------
  string_value (0)            | utf8
  bool_value (1)              | bool
  int64_value (2)             | int64
  int32_bitmask (3)           | int32
  string_list (4)             | list<utf8>
  int32_to_int32_list_map (5) | map<int32, list<int32>>

  Each metadatum is identified by an integer code. The recognized
  codes are defined as constants. Codes [0, 10_000) are reserved
  for ADBC usage. Drivers/vendors will ignore requests for
  unrecognized codes (the row will be omitted from the result).
  """

The root object is a struct with 2 items, info_name and info_value. The info_value's format is +ud:0,1,2,3,4,5.

I don't know if these type ids correspond to the relative position in the union (which doesn't make sense for the official example, missing 0 to 3) or if these type ids correspond to certain types (which also doesn't make sense because int32 was 4 in the official example while it's 3 for the int32_bitmask field).

If the driver can arbitrarily choose them for each field, how do we use this information? Or should we just return a map for union types? like

%{
  info_name: [0, 1],
  info_value: %{
    string_value => "foo",
    bool_value => true,
    int64_value => 0,
    int32_bitmask => 0,
    string_list => ["bar"],
    int32_to_int32_list_map => %{
      0 => 42
    }
  }
}