Aircloak / aircloak

This repository contains the Aircloak Air frontend as well as the code for our Cloak query and anonymization platform
2 stars 0 forks source link

capitalization in mongodb #2577

Open sasa1977 opened 6 years ago

sasa1977 commented 6 years ago

It appears that in mongodb 3.4 the capitalization is different from other databases, and incorrect:

1) test upper(<col>) on input name in a sub-query on users (:"Elixir.Compliance.StringFunctions.upper(<col>).Test")
     test/compliance/string_functions_test.exs:40
     Inconsistent query results:
     Group 1:
       'Cloak.DataSource.MongoDB//mongodb3.4_encoded'
       'Cloak.DataSource.MySQL/mysql/mysql5.7'
       'Cloak.DataSource.PostgreSQL/postgresql/postgresql9.4'
       'Cloak.DataSource.PostgreSQL/postgresql/postgresql9.4_encoded'
     Group 2:
       'Cloak.DataSource.MongoDB//mongodb3.4'
     Query:
       SELECT
         output
       FROM (
         SELECT
           user_id,
           upper("name") as output
         FROM users
         ORDER BY 1, 2
       ) table_alias
       ORDER BY output

...
left: [... %{occurrences: 8, row: ["ANNA JURIĆ"], users_count: 8}, ...],
right: [..., %{occurrences: 8, row: ["ANNA JURIć"], users_count: 8},  ,...]
sasa1977 commented 6 years ago

It looks like the problem here is that projections such as toLower and toUpper work only on ASCII characters. See here for example.

This could be solved by emulating such functions, or alternatively we could map each document in mongodb with something like:

db.coll.find().map(function(doc){
  doc.field = doc.field.toUpperCase(); 
  return doc;
})

In the meantime, I think we shouldn't include mongodb in these tests (upper and lower).

cc @cristianberneanu

cristianberneanu commented 6 years ago

Unfortunately, mongo lacks severely in the provided functions department.

Either we ignore the issue for now, or, like you said, mark the functions as unsupported, forcing emulation. Although that will make the TeamBank queries even slower. We could also expose different Unicode and ASCII versions, but I am not sure it is worth it.

sasa1977 commented 6 years ago

Although that will make the TeamBank queries even slower.

The problem is that currently these functions are working correctly only for ascii characters.

We could also expose different Unicode and ASCII versions, but I am not sure it is worth it.

I was thinking about this too, and I think it's an idea worth considering. Beyond mongodb, I've seen some other cases, where results of unicode functions such as upper or lower differ between different databases (see #2578). I think it's quite hard to ensure consistent behaviour here. The problem becomes worse if the same function is invoked as emulated in one part of the query, and not emulated in another.

So I wonder if we should expose consistent versions of these functions, such as nupper and nlower. These functions would always be emulated, and therefore we'd get consistent results. The documentation should explain the trade-off of performance vs correctness and consistency.

WDYT?

Let's also ping @obrok and @sebastian for additional thoughts here.

cristianberneanu commented 6 years ago

The problem is that currently these functions are working correctly only for ascii characters.

Is there still a bug if the all the customer's data is ASCII?

obrok commented 6 years ago

So I wonder if we should expose consistent versions of these functions, such as nupper and nlower.

I like this option

sasa1977 commented 6 years ago

Is there still a bug if the all the customer's data is ASCII?

No, AFAIK, everything works fine on ascii characters.

sebastian commented 6 years ago

Specialized functions that end up with emulation in those databases that don't handle it correctly out of the box sounds a bit like a cop-out, but also like a pragmatic and decent solution. In other words, I am for it.

sebastian commented 6 years ago

This seems like a larger change. Moving it away from the current milestone.

obrok commented 5 years ago

@sebastian This is not in 19.3, but it seems like it would solve #3109 that is in there. Perhaps we should do this instead?

sebastian commented 5 years ago

Yes, agreed. Adding to milestone.

fjab commented 5 years ago

@obrok & @cristianberneanu is this currently a problem at TeamBank? If not, should be moved down the road (which is painful since it's so old already).

cristianberneanu commented 5 years ago

I don't think they care. It is mostly needed so that are tests return consistent results between different backends. Moving to 19.4

fjab commented 5 years ago

Ok, thanks!

cristianberneanu commented 4 years ago

This should be closed or frozen, since we plan to drop MongoDB.