sql: add support for ARRAY type

petermattis commented 9 years ago

[x] ~arrays in expressions~ #10518 #10519
[x] arrays on pgwire #13358 #10713 #13434
[x] arrays in pg_catalog virtual tables #13524 #12526 #13470
[x] arrays in table/view descriptors (storing arrays)
[x] KV array encoding (storing arrays)
[x] DistSQL support (encoding over the wire)
[x] optimize table lookups for = ANY(array) like for IN expressions #13593

vivekmenezes commented 8 years ago

I've been thinking about how youtube could store all their user comments on cockroachdb. On youtube you can post a comment against a video. You can also reply to an existing comment about a video. So the comment space can be modeled in a table where every row represents a video, with a column containing an array of all the comments related to a video. In addition comments can reference other comments, so the comment type needs to be a tuple like <id, other-comment-id, comment>

The array type should support fast append and update operations, to allow quick commenting and editing of comments.

kevincox commented 8 years ago

@vivekmenezes Your method would work for small comment sets but you would quickly run into an issue with contention on that array for videos with people commenting quickly and size problems with large amounts of comments. However I think that your suggestion of update operations is a great one. Having array operations built in would avoid having to read the value and then process it locally in a transaction before sending back the new array to the server. Using update commands could create much shorter transactions witch would obviously be good for performance.

JackKrupansky commented 8 years ago

I'd recommend arrays within a row to be limited to "modest" size (dozens, maybe hundreds of entries at most.) In any case, a row should be a manageable size.
User-defined types are great to define the tuple to be stored in an array. Synchronizing a bunch of parallel arrays (user ID, timestamp, comment text) is a pain. Need to look at how they would be queried, such as all comments by a particular user or all comments within a timestamp range.
Non-modest arrays are best handled as rows in a table, accessed either vis JOIN, nested SELECT, or a separate manual query.
Non-modest arrays frequently need to be queryable, so rows in a table make sense.
The eternal question: Should we just do what Postgres does? (Or at least use it as a starting point for discussion.)
Most comments these days are threaded (hierarchical), which further argues that user comments should be distinct rows.

vivekmenezes commented 8 years ago

One of the reasons for me putting in a real world example, is to allow the choice of an architecture where each element of the array maps to a different kv pair on the underlying KV store be part of the consideration here. It will still mean a lot of contention reading the entire array, but perhaps we anyway need to support a read with a timestamp a few milliseconds into the past, or perhaps a read up to a certain array index to reduce contention.

Alternatively, propose a different schema for the domain. A table with each individual comment stored in an individual row and the comment id acting as the primary key, with a column for the video id and secondary index on it. The comments are read using a secondary index on the video.

A key design consideration is the ability to read all the comments off a single node (few nodes)

On Tue, Jan 26, 2016 at 11:33 AM Kevin Cox notifications@github.com wrote:

@vivekmenezes https://github.com/vivekmenezes Your method would work for small comment sets but you would quickly run into an issue with contention on that array for videos with people commenting quickly and size problems with large amounts of comments. However I think that your suggestion of update operations is a great one. Having array operations built in would avoid having to read the value and then process it locally in a transaction before sending back the new array to the server. Using update commands could create much shorter transactions witch would obviously be good for performance.

— Reply to this email directly or view it on GitHub https://github.com/cockroachdb/cockroach/issues/2115#issuecomment-175103819 .

kevincox commented 8 years ago

WWPGD (What would postgres do) seems like a pretty good starting point for db related questions (it has served sqlite well although very different from cockroach).

I suspect that inline arrays should probably not go over a "modest" size. I'm not saying it should be forbidden but probably discouraged and not optimized for.

At the very least there comes a point where returning the whole array becomes not very useful and that probably means that you need to filter, and once you filter you should probably be using a separate table.

That's my 2 cents at least.

JackKrupansky commented 8 years ago

We might also consider the Cassandra model of collections - list, set, and map. Sure, sometimes people actually do want mathematical arrays, matrices, and multi-dimensional arrays (ala Postgres), but sometimes they just want simple lists or are trying to simulate a set using an array, or similarly storing keyword/value pairs in two-dimensional arrays or parallel linear arrays.

The map collection type is nice since it lets you store and query arbitrary semi-structured data, typically keyword/value pairs much more naturally.

Cassandra has convenient operations to operate on lists, sets, and maps, such as append and prepend for list, removing list elements, removing all occurrences of a value from a list or set, removing from a map by keyword, etc. Doing all this with raw arrays would be... not so fun.

And I would encourage the same philosophy that Cassandra adopted for lists, sets, and maps: "Collections are meant for storing/denormalizing relatively small amount of data. They work well for things like “the phone numbers of a given user”, “labels applied to an email”, etc. But when items are expected to grow unbounded (“all the messages sent by a given user”, “events registered by a sensor”, ...), then collections are not appropriate anymore and a specific table (with clustering columns) should be used."

It all depends on the use case.

See: https://cassandra.apache.org/doc/cql3/CQL.html#collections

kevincox commented 8 years ago

The Cassandra collection model looks really nice. It also appears similar to what I was recommending. I have quoted some of what I deem as the important parts below:

"Collections are meant for storing/denormalizing relatively small amount of data."
"Collections are always read in their entirety"
"Collections cannot have more than 65535 elements."

That limit seems pretty high but I wouldn't be surprised if it is considered bad practice to get anywhere near that. Personally I would be thinking that embedded collections would be suited for "tens of elements" at most.

Again, there are multiple right answers, just sharing what I'm thinking :)

MehSha commented 8 years ago

+1 for Cassandra collection model. arrays are very cool, reduce number of queries and checks to get data, both in one2many and many2many cases. in many cases we emulate set by two queries (even if we have array). and maps, eliminate two column tables and unnecessary relations. all this inches us closed to embedding model (like in Mongodb) and increase both convenience (number of queries) and performance.

please implement

amangoeliitb commented 8 years ago

Any updates on this @petermattis ?

petermattis commented 8 years ago

@amangoeliitb ARRAY is on our roadmap, but not scheduled for the near term. Do you mind explaining what your use case is? Perhaps there is a suitable alternative using the existing SQL functionality.

kulshekhar commented 8 years ago

@petermattis One of the best use cases for ARRAY and JSON types would be to enable the implementation of a simple row level ACL system. They could also be a more performant alternative to JOINs in several cases, especially in a distributed setup.

tlvenn commented 7 years ago

I have seen quite a few Array related PRs been merged lately and was wondering how far we are from basic support of array type. Thanks !

jhartman86 commented 7 years ago

+1 re: @kulshekhar's comment "could also be more performant alternative to joins in several cases"

cuongdo commented 7 years ago

What specific functionality do you need for arrays? Postgres exposes a lot of syntax and functions for arrays, so we've been limiting our work to what's needed to support Hibernate ORM.

kulshekhar commented 7 years ago

@cuongdo

index the array field
check if an array contains a particular value

These are the most important, from my perspective. Additional nice to have functionality would be

check if an array contains any value from another list
check if an array contains all values present in another list
append a value
merge/concatenate a new array array

tlvenn commented 7 years ago

Honestly, it would be already useful to be able to simply store and retrieve an array.

Then comes the list @kulshekhar posted and I agree, being able to index the array and check if an array contains a given value is probably the most desired features afterward.

spencerkimball commented 7 years ago

@jordanlewis @nvanbenschoten could you guys update this issue?

nvanbenschoten commented 7 years ago

We're currently in a position where ARRAYs can be used in system tables, but not much else. The initial support for them was added for ORM compatibility and all other functionality built around them was added later for the same reason. This was not because ORMs liberally create tables with ARRAY columns (in fact, I don't think we've ever run into that issue), but because system tables like information_schema and pg_catalog both contain ARRAY columns. Because these are both implemented as virtual schemas, we've been able to support these situations without addressing full ARRAY support.

The big blockers to being able to mark this issue as complete are fleshing out the KV array encoding and adding DistSQL support. Without these, arrays are not actually useful for user-level tables. Neither of these issues have been prioritized for the 1.0 release, so unless others feel differently, I think this should realistically be moved to a post-1.0 milestone, ideally to Q2 2017.

cuongdo commented 7 years ago

Another key functionality @bdarnell brought up is indexing of ARRAY columns.

justinj commented 7 years ago

Hi @MehSha, @kulshekhar, @tlvenn, @jhartman86, @amangoeliitb, @kajmagnus, @beldpro-ci, @thearchitect, @yinhm, @nickjackson, @arypurnomoz, @mbertschler, @sudhakar, @tlvenn, @bkleef, @domnulnopcea, @iamcarbon, @muesli, @heiskr, @arduanov, @daliwali, @manigandham since you expressed interest in this issue, I was wondering if you had any insight you could share on how you plan to use arrays within Cockroach!

I'm interested in a couple main questions:

How do you plan to (or how do you currently with Postgres) handle accessing the data inside the arrays, by indexing, via unnest, or something else?
How large do you expect your arrays to become (either in elements or in data)?
A couple of you mentioned indexing - are you referring to normal (non-GIN) indexing? Or GIN indexing? If the former, I'm interested in what kind of use-cases you have for that.
What's the appeal of using arrays for you, vs. using a normalized table?

Thanks so much!

kevincox commented 7 years ago

TL;DR I use arrays to contain "tags" which I almost exclusively query to see if some value is in the array of tags.

How do you plan to (or how do you currently with Postgres) handle accessing the data inside the arrays, by indexing, via unnest, or something else?

Most of my array usage is checking membership. For example I have short arrays of small data (numbers or small strings) and I do a query if x is a member of that array. Ocasionally I will return the whole set of array elements in a query.

My second most common use case is a small set of data such as email addresses which I will almost always return in it's entirety.

I almost never access by index.

How large do you expect your arrays to become (either in elements or in data)?

Very small. Most are [0 2] elements, a couple get to around 20. Again, these are integers or short strings.

A couple of you mentioned indexing - are you referring to normal (non-GIN) indexing? Or GIN indexing? If the former, I'm interested in what kind of use-cases you have for that.

I ocasionally use a GIN index to find rows that contain a certian element in an array.

What's the appeal of using arrays for you, vs. using a normalized table?

I expect that they will be stored closer to the row itself for quick checking as well as the ease of indexing and querying.

Hope that helps explain my use case. Feel free to ask for clarification or anything else.

kulshekhar commented 7 years ago

How do you plan to (or how do you currently with Postgres) handle accessing the data inside the arrays, by indexing, via unnest, or something else?

I mainly use these fields with queries. When I have to use the data from an array field, I normally parse it in the application.

How large do you expect your arrays to become (either in elements or in data)?

I have applications with Array field size ranging upto 20000 elements

A couple of you mentioned indexing - are you referring to normal (non-GIN) indexing? Or GIN indexing? If the former, I'm interested in what kind of use-cases you have for that.

GIN indexing does it for me

What's the appeal of using arrays for you, vs. using a normalized table?

The main use case is application level ACL - arrays allow row level permissions to be managed in a flexible and performant manner. It would also work better than joins in a distributed setup, I imagine

heiskr commented 7 years ago

I'll join in, why not? :)

How do you plan to (or how do you currently with Postgres) handle accessing the data inside the arrays, by indexing, via unnest, or something else?

Querying using ANY.

How large do you expect your arrays to become (either in elements or in data)?

Not unbounded, rarely more than a list of dozen IDs or so.

A couple of you mentioned indexing - are you referring to normal (non-GIN) indexing? Or GIN indexing? If the former, I'm interested in what kind of use-cases you have for that.

I'm not at the scale where that matters... yet ;)

What's the appeal of using arrays for you, vs. using a normalized table?

I'm not a 3rd normal form queen, just as long as I don't have duplicate data. I'm actually transitioning from NoSQL, so even just having a schema is a step in the traditional route. In these tables I'm storing versions of things, and the relationships are part of the versioning, so having to join over the different versions in a different table... it'd be like (parent_kind, parent_id, parent_version_id, child_kind... etc) Not a pleasant query to do every time I want to read, which is way way more common than write.

I would say though that jsonb support is the bigger blocker at this point.

kulshekhar commented 7 years ago

I would say though that jsonb support is the bigger blocker at this point

I would agree with this. jsonb could also be used for arrays

gr0uch commented 7 years ago

How do you plan to (or how do you currently with Postgres) handle accessing the data inside the arrays, by indexing, via unnest, or something else?

unnest. also array operators array_cat, array_append, array_remove, array_length, etc. I wouldn't be replacing entire arrays.

How large do you expect your arrays to become (either in elements or in data)?

Arbitrarily large, perhaps up to the row size limit.

A couple of you mentioned indexing - are you referring to normal (non-GIN) indexing? Or GIN indexing? If the former, I'm interested in what kind of use-cases you have for that.

I haven't had a use case that would benefit from indexing array columns yet.

What's the appeal of using arrays for you, vs. using a normalized table?

Better read performance and less joins, more flexible data modelling.

justinj commented 7 years ago

Thanks so much all of you for your input! It's much appreciated.

alex-sherwin commented 7 years ago

I haven't had a use case that would benefit from indexing array columns yet.

A simple, and probably common scenario, would be a standard MIME email that has been broken out into parts so it's indexed/searchable.

Such a domain model would (usually) contain a small number of TO and CC headers representing a (typically) small number of recipients.

In this case, it's quite convenient to be able to have a native array type in a column that is indexed so you can search for a specific email address. In postgres, this is easily accomplished with a text[] column, a GIN index and array operators while querying (which GIN indexes support, see https://www.postgresql.org/docs/9.6/static/indexes-types.html)

justinj commented 7 years ago

Going to close this since the only outstanding issue is not blocking array usage and has its own issue.

alex-sherwin commented 7 years ago

Going to close this since the only outstanding issue is not blocking array usage and has its own issue.

@justinj

It wasn't immediately obvious to me when checking on this issues updates to see that the feature was actually implemented and is part of 1.1 (currently beta).. just for anyone else looking here as well

cockroachdb / cockroach

sql: add support for ARRAY type #2115