FeatureBaseDB / featurebase

A crazy fast analytical database, built on bitmaps. Perfect for ML applications. Learn more at: http://docs.featurebase.com/. Start a Docker instance: https://hub.docker.com/r/featurebasedb/featurebase
https://www.featurebase.com
Apache License 2.0
2.53k stars 231 forks source link

Row/Col Terminology #262

Closed travisturner closed 7 years ago

travisturner commented 7 years ago

Creating this issue as a place to discuss and decide upon the proposed Row/Col terminology change.

For each of the query pairs below, the first query represents the current PQL, and the second query represents the proposed change to Row/Col terminology.

Labels of type <key_> below are to be decided.

Bitmap

Bitmap(id=1, frame="f")
{"results":[{"attrs":{},"bits":[200]}]}
Row(id=1, frame="f")
{"results":[{"attrs":{},"<key1>":[200]}]}

SetBit

SetBit(id=1, frame="f", profileID=200)
{"results":[true]}
Set(row=1, frame="f", col=200)
{"results":[true]}

Intersect/Union

Intersect(Bitmap(id=1, frame="f"), Bitmap(id=2, frame="f"))
{"results":[{"attrs":{},"bits":[200]}]}
Intersect(Row(id=1, frame="f"), Row(id=2, frame="f"))
{"results":[{"attrs":{},"<key2>":[200]}]}

TopN

TopN(Bitmap(id=1, frame="f"), frame="b.n", n=10)
{"results":[[{"key":1,"count":1},{"key":2,"count":1},...]]}
TopN(Row(id=1, frame="f"), frame="b.n", n=10)
{"results":[[{"key":1,"count":1},{"key":2,"count":1},...]]}

New Functions

Col(id=200, frame="f")
{"results":[{"attrs":{},"<key3>":[1]}]}

Intersect(Col(id=200, frame="f"), Col(id=201, frame="f"))
{"results":[{"attrs":{},"<key4>":[1]}]}
yuce commented 7 years ago

IMO Profile (although specific to Umbel) conveys the purpose better than Col (which tells about the layout of the data). Would it make sense to use something similar to Item (instead of Col/Profile) and Tag (or Property) (instead of Row/Bitmap`) ?

jaffee commented 7 years ago

Let me throw a wrench into the mix:

Bitmap:

Get(y=1, frame='f') 
{"results":[{"attrs":{},"row":[200]}]}

Setbit:

Set(x=200, y=1, frame='f')
{"results":[true]}

Intersect:

Intersect(Get(y=1, frame="f"), Get(y=2, frame="f"))
{"results":[{"attrs":{},"row":[200]}]}

TopN:

TopN(Get(y=1, frame="f"), frame="b.n", n=10)
{"results":[[{"y":1,"count":1},{"y":2,"count":1},...]]}

New:

Get(x=200, frame="f")
{"results":[{"attrs":{},"column":[1]}]}

Intersect(Get(x=200, frame="f"), Get(x=201, frame="f"))
{"results":[{"attrs":{},"column":[1]}]}

Every bit in pilosa has an x value and y value. (X value increases out to the right, Y value increases as you move down (or up... it's arbitrary)) When you are calling Bitmap(id=5), you're really saying "Get me all set bits that have y value equal to 5." The result is the description of a row in the matrix. A column query can still use Get, but with x=5 which will return a bunch of different y values.

In the return values, we could say xs and ys instead of row and column.

benbjohnson commented 7 years ago

@yuce IMO I think it's reasonable to convey the layout of the data for a database.

yuce commented 7 years ago

@benbjohnson items_attrs I don't remember working with any database which I had to refer to the layout of data in a query. It would be great if you could give some examples.

For me, what Set(item=5, tag=10) does is more clear than Set(col=10, row=5). Why would I have to know whether tags (or properties, or attributes) are stored in rows or columns (the same applies to using x and y too. What do they refer to?) Apart from making understanding queries harder, it makes making errors easy (Imagine someone wrote Set(col=10, row=5) in their code, would you be able to easily tell that the query is correct?).

I've checked the Pilosa docs I have to see whether I'm missing something and in "Pilosa - Capabilitlies v0.5.pdf" I saw that Items/Attributes (or Tags) were already used (I've attached a screenshot).

Having said that, maybe coming up with an SQL-like query language would be beneficial (maybe in a future version?) As far as I see all databases which allow complex queries try to employ an SQL-like syntax. A few quick ideas are below:

Bitmap

Bitmap(id=1, frame="f")

# select items with tag 1
select item from f where tag = 1

# select tags of item 5
select tag from f where item = 5

SetBit

SetBit(id=1, frame="f", profileID=200)

update f set tag = 1 where item = 200

Intersect/Union

Intersect(Bitmap(id=1, frame="f"), Bitmap(id=2, frame="f"))

select item from f where tag = 1 intersect select item from f where tag = 2

yuce commented 7 years ago

To clarify, my point is not the terms item and tag are the best; we could use e.g., object, thing, or even profile to denote individual records; and attribute or property to denote features of a record. Just that item and tag happens to be short and sufficiently informative (I think).

benbjohnson commented 7 years ago

@yuce SQL requires you to think of the tabular format regardless of the actual verbiage used in SQL. Relational databases layer a schema over the format so you refer to columns by name but conceptually I don't think it's different than referencing "col1", "col2", etc.

I think my main issue is that the distinction of items & attributes is arbitrary in Pilosa. Right now we view a Profile as having an attribute but the relationship between rows and columns might not be "ownership". For example, you could track users in a system who have sent a message to another user. In that case, both rows and columns would represent users. One is not necessarily an "item" and one an "attribute".

Another issue that comes up from the "item" / "attribute" naming scheme is that we already use the term "attributes" to describe the map of fields associated.

I find the term "tag" confusing as an option because I feel like it's typically used for a list of string tags on objects similar to hashtags.

yuce commented 7 years ago

@benbjohnson (I don't want to prolong the SQL discussion, and I respect your opinion. But widespread use of SQL-like query languages show that it is not constrained to the table metaphor. I think there is a good reason that there is no (widely used) relational database which forces the user to use col1, col2, ... instead of name, birthdate, ...)

I think your example makes my point much clearer. Assuming we have the following in our database:

Set(row=5, frame="emailed", col=10)

What happened? Did 5 sent an email to 10 or vice versa? Or both? How about this:

Set(col=5, frame="emailed", row=10)

I am not very clever and stayed late last night. It takes a few seconds for me to forget which is which.

Even if the row and column store the same kind of data (e.g., user ID) they are clearly not equivalent(1). We store different things in the row and column, namely the subject (or record, item, entity, etc.) and the detail (or feature, property, tag)(2). The problem is deciding whether the row is the subject or the feature, or vice versa in a straightforward and natural way.

How about the following? Isn't the purpose much clearer since it labels the record (or subject) and feature (or detail)?

Set(item=5, frame="emailed", tag=10)

Ah, 5 sent an email to 10. Maybe 10 replied back:

Set(item=10, frame="emailed", tag=5)

It's easy to remember that item is the subject and tag is the object. Changing order of parameters doesn't affect readability as much:

Set(tag=10, frame="emailed", item=5)

Since in English the sentence begins with the subject, I would guess most people would start the query with item instead of tag. So it's easy and natural to have a convention about the order of params.

Tags reminds me a set of things, probably strings (but doesn't have to be). Let's assume John is from the US, was born in 1988-02-01 and he emailed Jane; John item (or entity) would have the following tags:

Jane is also from the US, works in the finance industry and has 2 children:

A query to get who lives in the US? becomes: which entities have the country:US tag?

Like I mentioned in another comment, my point is not arbitrarily replacing col and row with item and tag, but clearly conveying the difference between the whole (record, subject, ...) and parts of the whole (detail, feature, ...); we could just use other words.


(*1) Well, we could treat row and column represent exactly the same thing, so having (5, 10) = true relation would mean: 5 sent an email to 10 and also 10 sent an email to 5; but that could easily be achieved by adding (10, 5) = true when we treat row and column are different things.

(*2) The only thing we would store in Pilosa that I can think of which doesn't have the subject/detail relation is a matrix of booleans to be used in some mathematical computation.

travisturner commented 7 years ago

While I can see the value of using terms that guide the direction, and thus usage, of the relationship—as in object has traits—the underlying data structure does not necessarily map to a specific direction. Instead, the direction determines the kinds of queries Pilosa can efficiently perform. So if you put object on the x-axis and trait on the y-axis (like Umbel's use case), you can answer questions like:

Intersect(Get(trait-10), Get(trait-20))

returns: list of objects sharing traits 10 and 20

But what if your use case required that you return a list of traits common to objects 1 and 2? In that case you would need to put object on the y-axis and trait on the x-axis. If we had enforced a direction in the terminology and decided that the x-axis would always represent objects, then you as a user are now going to have to refer to object as trait and visa versa just to support the query that you want.

Considering this, if we did want to support something like an object/trait schema, I would see that implemented as an abstraction layer on top of a more generic data structure like row/col or x/y. This would allow the implementation to determine whether to store its object on the x or y axis. With that said, I don't think that's an abstraction that we want to take on right now.

As for @jaffee's suggestion of using x and y, I'm not strongly drawn to that, but I am interested in the way it lends itself to supporting an additional axis (i.e. the z-axis) without having to change much. This would be a possible future roadmap item where the query

Set(x=1, y=2, z=3, frame="f")

would effectively store:

Set(x=1, y=2, frame="f::plane-xy")
Set(x=1, y=3, frame="f::plane-xz")
Set(x=2, y=3, frame="f::plane-yz")

I haven't put a lot of thought into this, and I don't really want to get sidetracked too much by this right now, but it makes for a stronger x/yargument (i.e. what would a third dimension after row/col be called? perhaps page or sheet?)

Regarding SQL, I don't think SQL necessarily lends itself well to pilosa-type queries, and this is another example where I would rather leave that up to someone interested in writing a SQL abstraction layer on top of the Pilosa API. There are several cases I can imagine where SQL would get particularly confusing or cumbersum:

To use @yuce's example:

# select items with tag 1
select item from f where tag = 1;

how would you specify that you want to return items containing both tags 1 and 2?

select item from f where tag = 1 and tag = 2;

seems really confusing. I would wonder why tag = 1 and tag = 2 doesn't evalute to false.

Bitwise-operation queries that cross frames become confusing if you try to think of them like a JOIN:

Intersect(Row(1 frame="f"), Row(2, frame="f"), Row(300, frame="g"))
select cols from f, g where f.row[1] = true and f.row[2] = true and g.row[300] = true;

It's not clear to me which frame cols should apply to there, which makes sense because cols aren't specific to a single frame in pilosa the way columns are specific to a table in a relational database.

With all that said, I'm still in favor of row/col, but as I mentioned above, I'm intrigued a bit by x/y. Still very open to other suggestions, but so far I haven't heard anything overwhelmingly convincing.

jaffee commented 7 years ago

@travisturner I agree x/y isn't great, however something that came out of me going through that mental exercise are the keys for the results. In your original example, key1 and key2 would be row, and key3 and key4 would be col (or column).

So a Row query returns a row, and a Col query returns a column. Intersect, Union, or Difference operating on rows or columns returns a row or a column respectively.

What could be simpler?

yuce commented 7 years ago

@travisturner The example you've given illustrates the perils of using the structure of the underlying data, instead of easy-to-understand relationship labels. Let's suppose Pilosa only supports row/cols or x/ys and we've modeled the database by putting the objects on the x-axis and traits on the y-axis. Later it turned out that that decision yielded poor performance and we had to inverse the axis. How do we do that? We have to change all code so rows become cols and vice versa AND create a migration.

If we had labeled the data appropriately, like Set(object=5, trait=10) we wouldn't have to change anything in the code; we would probably get away with running a migration which swapped the row and col; and change a setting in the database which would flip object and trait.

Having more than 2 dimensions is a great idea, and it strengthens the idea that we should be using labels instead of axis names. What axis names we would use if we supported 10 dimensions?

benbjohnson commented 7 years ago

Another option that would solve this whole discussion of semantics is to simply require that frames must be explicitly created with a schema and the end user can define the names of the dimensions themselves. Then Umbel could still use Set(frame=d, profile=10, bit=20) because profile and bit were declared as dimensions in t he schema.

As nice as it is to write willy-nilly to frames and have them automatically created, I think that explicitly creating databases and frames is a better direction long term. For example, if someone accidentally types the wrong db or frame in a Set() query then they'll create a new db/frame. It also gives us a place to set options like the time quantum instead of trying to determine it at write-time.

travisturner commented 7 years ago

Closing this ticket since the schema work that @benbjohnson is working on removes the need for row/col decision.