Closed travisturner closed 7 years ago
IMO Profile
(although specific to Umbel) conveys the purpose better than Col
(which tells about the layout of the data). Would it make sense to use something similar to Item
(instead of Col
/Profile
) and Tag
(or Property
) (instead of Row
/Bitmap`) ?
Let me throw a wrench into the mix:
Bitmap:
Get(y=1, frame='f')
{"results":[{"attrs":{},"row":[200]}]}
Setbit:
Set(x=200, y=1, frame='f')
{"results":[true]}
Intersect:
Intersect(Get(y=1, frame="f"), Get(y=2, frame="f"))
{"results":[{"attrs":{},"row":[200]}]}
TopN:
TopN(Get(y=1, frame="f"), frame="b.n", n=10)
{"results":[[{"y":1,"count":1},{"y":2,"count":1},...]]}
New:
Get(x=200, frame="f")
{"results":[{"attrs":{},"column":[1]}]}
Intersect(Get(x=200, frame="f"), Get(x=201, frame="f"))
{"results":[{"attrs":{},"column":[1]}]}
Every bit in pilosa has an x value and y value. (X value increases out to the right, Y value increases as you move down (or up... it's arbitrary)) When you are calling Bitmap(id=5)
, you're really saying "Get me all set bits that have y value equal to 5." The result is the description of a row in the matrix. A column query can still use Get
, but with x=5
which will return a bunch of different y values.
In the return values, we could say xs
and ys
instead of row and column.
@yuce IMO I think it's reasonable to convey the layout of the data for a database.
@benbjohnson I don't remember working with any database which I had to refer to the layout of data in a query. It would be great if you could give some examples.
For me, what Set(item=5, tag=10)
does is more clear than Set(col=10, row=5)
. Why would I have to know whether tags (or properties, or attributes) are stored in rows or columns (the same applies to using x
and y
too. What do they refer to?) Apart from making understanding queries harder, it makes making errors easy (Imagine someone wrote Set(col=10, row=5)
in their code, would you be able to easily tell that the query is correct?).
I've checked the Pilosa docs I have to see whether I'm missing something and in "Pilosa - Capabilitlies v0.5.pdf" I saw that Items/Attributes (or Tags) were already used (I've attached a screenshot).
Having said that, maybe coming up with an SQL-like query language would be beneficial (maybe in a future version?) As far as I see all databases which allow complex queries try to employ an SQL-like syntax. A few quick ideas are below:
Bitmap(id=1, frame="f")
# select items with tag 1
select item from f where tag = 1
# select tags of item 5
select tag from f where item = 5
SetBit(id=1, frame="f", profileID=200)
update f set tag = 1 where item = 200
Intersect(Bitmap(id=1, frame="f"), Bitmap(id=2, frame="f"))
select item from f where tag = 1 intersect select item from f where tag = 2
To clarify, my point is not the terms item and tag are the best; we could use e.g., object, thing, or even profile to denote individual records; and attribute or property to denote features of a record. Just that item and tag happens to be short and sufficiently informative (I think).
@yuce SQL requires you to think of the tabular format regardless of the actual verbiage used in SQL. Relational databases layer a schema over the format so you refer to columns by name but conceptually I don't think it's different than referencing "col1"
, "col2"
, etc.
I think my main issue is that the distinction of items & attributes is arbitrary in Pilosa. Right now we view a Profile as having an attribute but the relationship between rows and columns might not be "ownership". For example, you could track users in a system who have sent a message to another user. In that case, both rows and columns would represent users. One is not necessarily an "item" and one an "attribute".
Another issue that comes up from the "item" / "attribute" naming scheme is that we already use the term "attributes" to describe the map of fields associated.
I find the term "tag" confusing as an option because I feel like it's typically used for a list of string tags on objects similar to hashtags.
@benbjohnson (I don't want to prolong the SQL discussion, and I respect your opinion. But widespread use of SQL-like query languages show that it is not constrained to the table metaphor. I think there is a good reason that there is no (widely used) relational database which forces the user to use col1, col2, ...
instead of name, birthdate, ...
)
I think your example makes my point much clearer. Assuming we have the following in our database:
Set(row=5, frame="emailed", col=10)
What happened? Did 5
sent an email to 10
or vice versa? Or both? How about this:
Set(col=5, frame="emailed", row=10)
I am not very clever and stayed late last night. It takes a few seconds for me to forget which is which.
Even if the row and column store the same kind of data (e.g., user ID) they are clearly not equivalent(1). We store different things in the row and column, namely the subject (or record, item, entity, etc.) and the detail (or feature, property, tag)(2). The problem is deciding whether the row is the subject or the feature, or vice versa in a straightforward and natural way.
How about the following? Isn't the purpose much clearer since it labels the record (or subject) and feature (or detail)?
Set(item=5, frame="emailed", tag=10)
Ah, 5
sent an email to 10
. Maybe 10
replied back:
Set(item=10, frame="emailed", tag=5)
It's easy to remember that item
is the subject and tag
is the object. Changing order of parameters doesn't affect readability as much:
Set(tag=10, frame="emailed", item=5)
Since in English the sentence begins with the subject, I would guess most people would start the query with item
instead of tag
. So it's easy and natural to have a convention about the order of params.
Tags reminds me a set of things, probably strings (but doesn't have to be). Let's assume John is from the US, was born in 1988-02-01 and he emailed Jane; John
item (or entity) would have the following tags:
country:US
,birthdate:1988-02-01
,emailed:Jane
.Jane is also from the US, works in the finance industry and has 2 children:
country:US
,industry:finance
,children:2
.A query to get who lives in the US? becomes: which entities have the country:US
tag?
Like I mentioned in another comment, my point is not arbitrarily replacing col
and row
with item
and tag
, but clearly conveying the difference between the whole (record, subject, ...) and parts of the whole (detail, feature, ...); we could just use other words.
(*1) Well, we could treat row and column represent exactly the same thing, so having (5, 10) = true
relation would mean: 5 sent an email to 10 and also 10 sent an email to 5; but that could easily be achieved by adding (10, 5) = true
when we treat row and column are different things.
(*2) The only thing we would store in Pilosa that I can think of which doesn't have the subject/detail relation is a matrix of booleans to be used in some mathematical computation.
While I can see the value of using terms that guide the direction, and thus usage, of the relationship—as in object has traits
—the underlying data structure does not necessarily map to a specific direction. Instead, the direction determines the kinds of queries Pilosa can efficiently perform. So if you put object
on the x-axis and trait
on the y-axis (like Umbel's use case), you can answer questions like:
Intersect(Get(trait-10), Get(trait-20))
returns: list of objects sharing traits 10 and 20
But what if your use case required that you return a list of traits common to objects 1 and 2? In that case you would need to put object
on the y-axis and trait
on the x-axis. If we had enforced a direction in the terminology and decided that the x-axis would always represent objects
, then you as a user are now going to have to refer to object
as trait
and visa versa just to support the query that you want.
Considering this, if we did want to support something like an object
/trait
schema,
I would see that implemented as an abstraction layer on top of a more generic data structure like row
/col
or x
/y
. This would allow the implementation to determine whether to store its object
on the x
or y
axis. With that said, I don't think that's an abstraction that we want to take on right now.
As for @jaffee's suggestion of using x
and y
, I'm not strongly drawn to that, but I am interested in the way it lends itself to supporting an additional axis (i.e. the z-axis) without having to change much. This would be a possible future roadmap item where the query
Set(x=1, y=2, z=3, frame="f")
would effectively store:
Set(x=1, y=2, frame="f::plane-xy")
Set(x=1, y=3, frame="f::plane-xz")
Set(x=2, y=3, frame="f::plane-yz")
I haven't put a lot of thought into this, and I don't really want to get sidetracked too much by this right now, but it makes for a stronger x
/y
argument (i.e. what would a third dimension after row
/col
be called? perhaps page
or sheet
?)
Regarding SQL, I don't think SQL necessarily lends itself well to pilosa-type queries, and this is another example where I would rather leave that up to someone interested in writing a SQL abstraction layer on top of the Pilosa API. There are several cases I can imagine where SQL would get particularly confusing or cumbersum:
To use @yuce's example:
# select items with tag 1
select item from f where tag = 1;
how would you specify that you want to return items containing both tags 1 and 2?
select item from f where tag = 1 and tag = 2;
seems really confusing. I would wonder why tag = 1 and tag = 2
doesn't evalute to false
.
Bitwise-operation queries that cross frames become confusing if you try to think of them like a JOIN:
Intersect(Row(1 frame="f"), Row(2, frame="f"), Row(300, frame="g"))
select cols from f, g where f.row[1] = true and f.row[2] = true and g.row[300] = true;
It's not clear to me which frame cols
should apply to there, which makes sense because cols
aren't specific to a single frame
in pilosa the way columns are specific to a table in a relational database.
With all that said, I'm still in favor of row
/col
, but as I mentioned above, I'm intrigued a bit by x
/y
. Still very open to other suggestions, but so far I haven't heard anything overwhelmingly convincing.
@travisturner I agree x/y isn't great, however something that came out of me going through that mental exercise are the keys for the results. In your original example, key1
and key2
would be row
, and key3
and key4
would be col
(or column
).
So a Row
query returns a row
, and a Col
query returns a column
. Intersect, Union, or Difference operating on rows or columns returns a row
or a column
respectively.
What could be simpler?
@travisturner The example you've given illustrates the perils of using the structure of the underlying data, instead of easy-to-understand relationship labels. Let's suppose Pilosa only supports row/cols or x/ys and we've modeled the database by putting the objects on the x-axis and traits on the y-axis. Later it turned out that that decision yielded poor performance and we had to inverse the axis. How do we do that? We have to change all code so rows become cols and vice versa AND create a migration.
If we had labeled the data appropriately, like Set(object=5, trait=10)
we wouldn't have to change anything in the code; we would probably get away with running a migration which swapped the row and col; and change a setting in the database which would flip object and trait.
Having more than 2 dimensions is a great idea, and it strengthens the idea that we should be using labels instead of axis names. What axis names we would use if we supported 10 dimensions?
Another option that would solve this whole discussion of semantics is to simply require that frames must be explicitly created with a schema and the end user can define the names of the dimensions themselves. Then Umbel could still use Set(frame=d, profile=10, bit=20)
because profile
and bit
were declared as dimensions in t he schema.
As nice as it is to write willy-nilly to frames and have them automatically created, I think that explicitly creating databases and frames is a better direction long term. For example, if someone accidentally types the wrong db or frame in a Set()
query then they'll create a new db/frame. It also gives us a place to set options like the time quantum instead of trying to determine it at write-time.
Closing this ticket since the schema work that @benbjohnson is working on removes the need for row/col decision.
Creating this issue as a place to discuss and decide upon the proposed Row/Col terminology change.
For each of the query pairs below, the first query represents the current PQL, and the second query represents the proposed change to Row/Col terminology.
Labels of type
<key_>
below are to be decided.Bitmap
SetBit
Intersect/Union
TopN
New Functions