Closed petermattis closed 7 years ago
I've been thinking about how youtube could store all their user comments on cockroachdb. On youtube you can post a comment against a video. You can also reply to an existing comment about a video. So the comment space can be modeled in a table where every row represents a video, with a column containing an array of all the comments related to a video. In addition comments can reference other comments, so the comment type needs to be a tuple like <id, other-comment-id, comment>
The array type should support fast append and update operations, to allow quick commenting and editing of comments.
@vivekmenezes Your method would work for small comment sets but you would quickly run into an issue with contention on that array for videos with people commenting quickly and size problems with large amounts of comments. However I think that your suggestion of update operations is a great one. Having array operations built in would avoid having to read the value and then process it locally in a transaction before sending back the new array to the server. Using update commands could create much shorter transactions witch would obviously be good for performance.
One of the reasons for me putting in a real world example, is to allow the choice of an architecture where each element of the array maps to a different kv pair on the underlying KV store be part of the consideration here. It will still mean a lot of contention reading the entire array, but perhaps we anyway need to support a read with a timestamp a few milliseconds into the past, or perhaps a read up to a certain array index to reduce contention.
Alternatively, propose a different schema for the domain. A table with each individual comment stored in an individual row and the comment id acting as the primary key, with a column for the video id and secondary index on it. The comments are read using a secondary index on the video.
A key design consideration is the ability to read all the comments off a single node (few nodes)
On Tue, Jan 26, 2016 at 11:33 AM Kevin Cox notifications@github.com wrote:
@vivekmenezes https://github.com/vivekmenezes Your method would work for small comment sets but you would quickly run into an issue with contention on that array for videos with people commenting quickly and size problems with large amounts of comments. However I think that your suggestion of update operations is a great one. Having array operations built in would avoid having to read the value and then process it locally in a transaction before sending back the new array to the server. Using update commands could create much shorter transactions witch would obviously be good for performance.
— Reply to this email directly or view it on GitHub https://github.com/cockroachdb/cockroach/issues/2115#issuecomment-175103819 .
WWPGD (What would postgres do) seems like a pretty good starting point for db related questions (it has served sqlite well although very different from cockroach).
I suspect that inline arrays should probably not go over a "modest" size. I'm not saying it should be forbidden but probably discouraged and not optimized for.
At the very least there comes a point where returning the whole array becomes not very useful and that probably means that you need to filter, and once you filter you should probably be using a separate table.
That's my 2 cents at least.
We might also consider the Cassandra model of collections - list, set, and map. Sure, sometimes people actually do want mathematical arrays, matrices, and multi-dimensional arrays (ala Postgres), but sometimes they just want simple lists or are trying to simulate a set using an array, or similarly storing keyword/value pairs in two-dimensional arrays or parallel linear arrays.
The map collection type is nice since it lets you store and query arbitrary semi-structured data, typically keyword/value pairs much more naturally.
Cassandra has convenient operations to operate on lists, sets, and maps, such as append and prepend for list, removing list elements, removing all occurrences of a value from a list or set, removing from a map by keyword, etc. Doing all this with raw arrays would be... not so fun.
And I would encourage the same philosophy that Cassandra adopted for lists, sets, and maps: "Collections are meant for storing/denormalizing relatively small amount of data. They work well for things like “the phone numbers of a given user”, “labels applied to an email”, etc. But when items are expected to grow unbounded (“all the messages sent by a given user”, “events registered by a sensor”, ...), then collections are not appropriate anymore and a specific table (with clustering columns) should be used."
It all depends on the use case.
See: https://cassandra.apache.org/doc/cql3/CQL.html#collections
The Cassandra collection model looks really nice. It also appears similar to what I was recommending. I have quoted some of what I deem as the important parts below:
That limit seems pretty high but I wouldn't be surprised if it is considered bad practice to get anywhere near that. Personally I would be thinking that embedded collections would be suited for "tens of elements" at most.
Again, there are multiple right answers, just sharing what I'm thinking :)
+1 for Cassandra collection model. arrays are very cool, reduce number of queries and checks to get data, both in one2many and many2many cases. in many cases we emulate set by two queries (even if we have array). and maps, eliminate two column tables and unnecessary relations. all this inches us closed to embedding model (like in Mongodb) and increase both convenience (number of queries) and performance.
please implement
Any updates on this @petermattis ?
@amangoeliitb ARRAY
is on our roadmap, but not scheduled for the near term. Do you mind explaining what your use case is? Perhaps there is a suitable alternative using the existing SQL functionality.
@petermattis One of the best use cases for ARRAY
and JSON
types would be to enable the implementation of a simple row level ACL system. They could also be a more performant alternative to JOINs in several cases, especially in a distributed setup.
I have seen quite a few Array related PRs been merged lately and was wondering how far we are from basic support of array type. Thanks !
+1 re: @kulshekhar's comment "could also be more performant alternative to joins in several cases"
What specific functionality do you need for arrays? Postgres exposes a lot of syntax and functions for arrays, so we've been limiting our work to what's needed to support Hibernate ORM.
@cuongdo
These are the most important, from my perspective. Additional nice to have functionality would be
Honestly, it would be already useful to be able to simply store and retrieve an array.
Then comes the list @kulshekhar posted and I agree, being able to index the array and check if an array contains a given value is probably the most desired features afterward.
@jordanlewis @nvanbenschoten could you guys update this issue?
We're currently in a position where ARRAYs can be used in system tables, but not much else. The initial support for them was added for ORM compatibility and all other functionality built around them was added later for the same reason. This was not because ORMs liberally create tables with ARRAY columns (in fact, I don't think we've ever run into that issue), but because system tables like information_schema
and pg_catalog
both contain ARRAY columns. Because these are both implemented as virtual schemas, we've been able to support these situations without addressing full ARRAY support.
The big blockers to being able to mark this issue as complete are fleshing out the KV array encoding and adding DistSQL support. Without these, arrays are not actually useful for user-level tables. Neither of these issues have been prioritized for the 1.0 release, so unless others feel differently, I think this should realistically be moved to a post-1.0 milestone, ideally to Q2 2017.
Another key functionality @bdarnell brought up is indexing of ARRAY columns.
Hi @MehSha, @kulshekhar, @tlvenn, @jhartman86, @amangoeliitb, @kajmagnus, @beldpro-ci, @thearchitect, @yinhm, @nickjackson, @arypurnomoz, @mbertschler, @sudhakar, @tlvenn, @bkleef, @domnulnopcea, @iamcarbon, @muesli, @heiskr, @arduanov, @daliwali, @manigandham since you expressed interest in this issue, I was wondering if you had any insight you could share on how you plan to use arrays within Cockroach!
I'm interested in a couple main questions:
unnest
, or something else?Thanks so much!
TL;DR I use arrays to contain "tags" which I almost exclusively query to see if some value is in the array of tags.
How do you plan to (or how do you currently with Postgres) handle accessing the data inside the arrays, by indexing, via unnest, or something else?
Most of my array usage is checking membership. For example I have short arrays of small data (numbers or small strings) and I do a query if x is a member of that array. Ocasionally I will return the whole set of array elements in a query.
My second most common use case is a small set of data such as email addresses which I will almost always return in it's entirety.
I almost never access by index.
How large do you expect your arrays to become (either in elements or in data)?
Very small. Most are [0 2] elements, a couple get to around 20. Again, these are integers or short strings.
A couple of you mentioned indexing - are you referring to normal (non-GIN) indexing? Or GIN indexing? If the former, I'm interested in what kind of use-cases you have for that.
I ocasionally use a GIN index to find rows that contain a certian element in an array.
What's the appeal of using arrays for you, vs. using a normalized table?
I expect that they will be stored closer to the row itself for quick checking as well as the ease of indexing and querying.
Hope that helps explain my use case. Feel free to ask for clarification or anything else.
How do you plan to (or how do you currently with Postgres) handle accessing the data inside the arrays, by indexing, via unnest, or something else?
I mainly use these fields with queries. When I have to use the data from an array field, I normally parse it in the application.
How large do you expect your arrays to become (either in elements or in data)?
I have applications with Array field size ranging upto 20000 elements
A couple of you mentioned indexing - are you referring to normal (non-GIN) indexing? Or GIN indexing? If the former, I'm interested in what kind of use-cases you have for that.
GIN indexing does it for me
What's the appeal of using arrays for you, vs. using a normalized table?
The main use case is application level ACL - arrays allow row level permissions to be managed in a flexible and performant manner. It would also work better than joins in a distributed setup, I imagine
I'll join in, why not? :)
How do you plan to (or how do you currently with Postgres) handle accessing the data inside the arrays, by indexing, via unnest, or something else?
Querying using ANY.
How large do you expect your arrays to become (either in elements or in data)?
Not unbounded, rarely more than a list of dozen IDs or so.
A couple of you mentioned indexing - are you referring to normal (non-GIN) indexing? Or GIN indexing? If the former, I'm interested in what kind of use-cases you have for that.
I'm not at the scale where that matters... yet ;)
What's the appeal of using arrays for you, vs. using a normalized table?
I'm not a 3rd normal form queen, just as long as I don't have duplicate data. I'm actually transitioning from NoSQL, so even just having a schema is a step in the traditional route. In these tables I'm storing versions of things, and the relationships are part of the versioning, so having to join over the different versions in a different table... it'd be like (parent_kind, parent_id, parent_version_id, child_kind... etc) Not a pleasant query to do every time I want to read, which is way way more common than write.
I would say though that jsonb
support is the bigger blocker at this point.
I would say though that jsonb support is the bigger blocker at this point
I would agree with this. jsonb
could also be used for arrays
How do you plan to (or how do you currently with Postgres) handle accessing the data inside the arrays, by indexing, via unnest, or something else?
unnest
. also array operators array_cat
, array_append
, array_remove
, array_length
, etc. I wouldn't be replacing entire arrays.
How large do you expect your arrays to become (either in elements or in data)?
Arbitrarily large, perhaps up to the row size limit.
A couple of you mentioned indexing - are you referring to normal (non-GIN) indexing? Or GIN indexing? If the former, I'm interested in what kind of use-cases you have for that.
I haven't had a use case that would benefit from indexing array columns yet.
What's the appeal of using arrays for you, vs. using a normalized table?
Better read performance and less joins, more flexible data modelling.
Thanks so much all of you for your input! It's much appreciated.
I haven't had a use case that would benefit from indexing array columns yet.
A simple, and probably common scenario, would be a standard MIME email that has been broken out into parts so it's indexed/searchable.
Such a domain model would (usually) contain a small number of TO
and CC
headers representing a (typically) small number of recipients.
In this case, it's quite convenient to be able to have a native array type in a column that is indexed so you can search for a specific email address. In postgres, this is easily accomplished with a text[]
column, a GIN index and array operators while querying (which GIN indexes support, see https://www.postgresql.org/docs/9.6/static/indexes-types.html)
Going to close this since the only outstanding issue is not blocking array usage and has its own issue.
Going to close this since the only outstanding issue is not blocking array usage and has its own issue.
@justinj
It wasn't immediately obvious to me when checking on this issues updates to see that the feature was actually implemented and is part of 1.1 (currently beta).. just for anyone else looking here as well
= ANY(array)
like forIN
expressions #13593