rethinking `:*` - Githubissues

sasa1977 commented 7 years ago

@sebastian @obrok @cristianberneanu

I have stumbled upon an interesting subtle issue. The symptom is that Tableau thinks that integer uid columns are textual. After some investigation, it turned out that the reason is straightforward. Tableau will issue select * from table. We'll expand this into select all columns, and then replace uid with :* as an optimization step.

The problem is that by doing this, we're possibly changing the return type. If the uid column is e.g. integer, we'll return a textual result. Even the query features will incorrectly specify that the result is textual. I'm pretty certain that this is the reason why Tableau thinks that an integer uid column is in fact textual.

While I could likely invent some simple hack to fix the immediate issue, this raises a larger question. By returning :* we're implicitly changing the resulting type. The issue is not only related to uid columns in the select list. For example, result of select int_column from bar can be an integer, or it can be textual (:*), or even mixed. I believe this might cause issues with various clients.

Even for textual columns the result might be misleading if * is a normal possible value in some column. What does * in the result mean then? A masked value, or a valid one?

Given all this, I wonder if using NULL would be a better choice instead of :* to mask values? Thoughts?

obrok commented 7 years ago

What we want from * is similar to NULL in that it is a member of all types. The problem is that we're (in the output) mixing "*" and :* because there's no SQL syntax for :*.

However just using NULL has the same issue as "*" in that it might be a valid value with some meaning in the domain. Maybe it should be configurable per connection?

sasa1977 commented 7 years ago

Yeah, the problem is that what we want AFAIK doesn't exist in SQL (except for NULL) :-) Therefore, I think NULL is our best bet, because if nothing else, it is inline with type guarantees.

However just using NULL has the same issue as "*" in that it might be a valid value with some meaning in the domain. Maybe it should be configurable per connection?

This is a valid point, and a viable option, but I don't like it, because we now have two semantics (:* and NULL).

Maybe a better solution would be to support functions which can be used to distinguish between a NULL value which is a result of masking and a valid NULL:

SELECT foo, is_masked(foo) from bar
-- NULL, true
-- NULL, false
-- 42, false

SELECT foo, bar, is_masked() from bar
-- NULL, NULL, true
-- NULL, NULL, false
-- NULL, 42, false
-- 42, NULL, false

sebastian commented 7 years ago

On channels where we can provide more context (i.e. web interface and API) it would be good to explicitly mark values that are anonymized away. Then we could render them distinctly.

Does using NULL as a placeholder solve the issue you see in Tableau? From what I gather it does the SELECT *-query in order to discover the types? If so it seems NULL doesn't help?

sasa1977 commented 7 years ago

NULL could help, because NULL can be of any type, whereas * can only be string. However, I already fixed the Tableau issue in a slightly different way.

Regardless, I still think that * is a hack, more than a proper solution. If we want to provide some context, than this shouldn't be through magical field values, but rather as additional metadata.

cristianberneanu commented 7 years ago

Overloading null is also a hack. Keeping * is more friendly to the GUI users as they can see the difference clearly. In the end, it is about prioritizing what makes sense in one channel over the other.

sebastian commented 7 years ago

But it doesn't need to be the same across channels? If we provide metadata along with the value, then we can render it distinctly in the web interface, and provide proper context in the API where we are in full control, yet still play around with what is a sensible value for the PSQL interface?

sasa1977 commented 7 years ago

There are two challenges we need to solve here:

What do we return in place of masked data?
How do we distinguish masked data from normal one?

Currently we're solving both using the same trick: the magical * value. I believe that this leads to a hacky solution for both cases.

The problem in the first challenge is that we're returning polymorphic values. Value of the same column might be a string or e.g. an integer. It's worth keeping in mind that people might also use our REST API, so they will need to handle this case, and we need to properly document it. This is IMO needlessly confusing. Therefore, I believe that NULL is the most consistent thing we can return here, because we're not breaking SQL type semantics, and because NULL can appear in any column anyway. An advantage of this approach is that there are no variations between different channel. Regardless of the client type, we're always returning the same data for the same query over the same input.

When it comes to distinguishing masked data, as I said, * isn't a good indicator, because it can also be a valid field value. Admittedly, this is also true for NULL, or any other value we can think of. Therefore, using any magical value to indicate masked data is a hack, and I believe we need to have a different way of indicating masked data. That could be additional metadata attached to each returned row, and/or special functions (is_masked) which can be used to reliably determine whether data is masked or not. A dedicated masked metadata can then be reliably used in different channels (e.g. web UI) to visually indicate masked data. This will be more correct than relying on *.

Aircloak / aircloak

rethinking `:*` #1378