cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.85k stars 3.77k forks source link

sql: PG and CRDB arrays are not fully compatible #36815

Open andy-kimball opened 5 years ago

andy-kimball commented 5 years ago

PG supports multi-dimensional arrays where all data can be stored in a flat memory area. The array must have a single element type. CRDB allows arrays to hold nested arrays as elements. There are currently subtle behavioral differences between the two, such as (I'm sure there are others as well; this is not exhaustive list):

CRDB:
root@:26257/defaultdb> select pg_typeof((array[array[1,2], array[3,4]]));
  pg_typeof
+-----------+
  int[][]

root@:26257/defaultdb> select (array[array[1,2], array[3,4]])[1];
  array
+-------+
  {1,2}

PG:
template1=# select pg_typeof((array[array[1,2], array[3,4]]));
 pg_typeof
-----------
 integer[]

template1=# select (array[array[1,2], array[3,4]])[1];
 array
-------

See the below email thread between myself and Raphael for more details.

On Thu, Apr 11, 2019 at 4:12 PM Andy Kimball <andyk@cockroachlabs.com> wrote:
I'm still unsure of how to proceed on the array stuff. I came across these examples:

CRDB:
root@:26257/defaultdb> select pg_typeof((array[array[1,2], array[3,4]]));
  pg_typeof
+-----------+
  int[][]

root@:26257/defaultdb> select (array[array[1,2], array[3,4]])[1];
  array
+-------+
  {1,2}

PG:
template1=# select pg_typeof((array[array[1,2], array[3,4]]));
 pg_typeof
-----------
 integer[]

template1=# select (array[array[1,2], array[3,4]])[1];
 array
-------

Yep, PG returns NULL if you dereference one level of a multi-dimensional array. You only get non-null if you specify all dimensions. It does appear that PG mostly flattens the arrays in their representation, with a bit of sugar to make them appear "sort of" multidimensional (but not really).

Anyway, it seems we already have different behavior than PG with multi-dimensional arrays, as those two examples show. I'm not quite sure how to proceed. Should I try to match PG behavior? Or should I try to match CRDB behavior? Matching PG behavior seems problematic, since our DArray implementation and array functions already seem to assume arrays can be nested arbitrarily, rather than always being flattened into a single dimension.

And here's another mystifying PG example:

template1=# select (array[array[1, 2]::int2vector, array[3,4]::int2vector, array[5,6]::int2vector, array[7,8]::int2vector]);
  array
---------
 1 0 1 0

On Fri, Apr 12, 2019 at 2:28 AM Raphael 'kena' Poss <knz@cockroachlabs.com> wrote:
first of all, in this discussion we're missing a nuance.

Type systems (and type theory) distinguish between nested arrays and multi-dimensional arrays. 
The difference is subtle because some programming languages have decided (for good reason) to make them equivalent in many cases, but here we need to care about it because it's relevant.

To understand the nuance it's useful to model an array type as a combination of:

- a memory layout - i.e. how array values are positioned in memory
- metadata that describe the array type - e.g. information about the element type, dimensions, etc.

The simplest array type is a 1-dimensional array with a "flat" memory layout. The address of the array is the first element in the array.

Now, here's the nuance:

- nested arrays are hierarchies of 1-dimensional arrays, where the element type of the outer arrays is an array. These are also called "Iliffe vectors" (https://en.wikipedia.org/wiki/Iliffe_vector)

- multidimensional arrays put all the values in a single flat memory layout, and use a descriptor to indicate the size of the array in each dimension.

Some interesting distinguishing properties:

- iliffe vectors / nested arrays naturally support triangular matrixes or similar things where each row has a different dimensionality. Heck it's even conceptually possible to have different element types in different dimensions! Whereas true multi-dimensional arrays do not support these things.

- true multidimensional arrays support the "reshape" operator in constant time (e.g. changing the array from size 2x3 to 3x2), whereas nested arrays do not

And now here's the crux:

- internally, postgres supports true multidimensional arrays, and refuses to construct nested arrays / iliffe vectors. There's just one element type (the "cell" in the array) and alongside a dimension vector.

- cockroachdb does not (currently) support true multidimensional arrays, and instead stores arrays in RAM as iliffe vectors with some assertions during planning to refuse constructing an iliffe vector of heterogeneous dimensionality.

Why is this relevant?

- it explains why the OID of arrays in postgres is independent of dimensionality. 
- it makes no sense in postgres to index a multidimensional array partially, to retrieve a "row" at a time -- the indexing operator must return a scalar-like value. This works in cockroachdb because each array "level" is a scalar-like value (a reference to an array at the next level).

On to your questions:

Op 12-04-19 om 01:12 schreef Andy Kimball:
I'm still unsure of how to proceed on the array stuff. I came across these examples:

PG:
template1=# select pg_typeof((array[array[1,2], array[3,4]]));
 pg_typeof
-----------
 integer[]
This one is simple. pg first computes the OID for the type, *then* shows the type name. See explanation above.
We can emulate this behavior by 
1) ensuring this "flattening" behavior for OIDs, as we discussed yesterday

2) modifying `pg_typeof()` to do the same as pg.

template1=# select (array[array[1,2], array[3,4]])[1];
 array
-------

This is a separate concern -- it's about the behavior of the indexing operator. See above. 
I agree we should do the same, and I'd like to thank you for spotting this inconsistency, but it seems to be unrelated to your patchset.

Jira issue: CRDB-4483

knz commented 5 years ago

@mjibson that kind of compatibility transition is related to the discussion we had earlier this week about dates etc. I think the challenge is clearly up your alley.

rafiss commented 1 year ago

See also https://github.com/cockroachdb/cockroach/issues/32552 - in most other places, CRDB bans multi-dimensional arrays.