Closed sebastian closed 7 years ago
Can we arrange it so that at the point where the anonymization is being done, the uids will be sharded (i.e. all rows for a given uid will be on the same machine)?
Can we arrange it so that at the point where the anonymization is being done, the uids will be sharded (i.e. all rows for a given uid will be on the same machine)?
I don't think we can make this assumption. We will have no control over how the data is sharded, unless a system is setup specifically for Aircloak use, which is unlikely to be the case.
If it is map-reduce, then I would think we can arrange to sort on uid at some point the work flow, yes?
Are we talking mongo db here?
If it is map-reduce, then I would think we can arrange to sort on uid at some point the work flow, yes?
Are we talking mongo db here?
We are talking generic map reduce. Of course we can accumulate absolutely all values and reduce them at the very very end when we have a global view, but that is just as inefficient as loading all the data out of the environment and into the cloak, which is exactly what we have to try to avoid.
We need to manage to do some intermediate reductions, to constrain the required memory.
We are talking generic map reduce. Of course we can accumulate absolutely all values and reduce them at the very very end when we have a global view, but that is just as inefficient as loading all the data out of the environment and into the cloak, which is exactly what we have to try to avoid.
There is no way to do the UID mapping earlier, i.e. in the map phase? That seems weird....
If we can do our sampling early as well, then we can reduce load early....
There is no way to do the UID mapping earlier, i.e. in the map phase? That seems weird....
We certainly have the UID when receiving the row, i.e. in the mapping phase. But we don't necessarily receive all the rows for the particular user at the same place, and at at the same time.
We certainly have the UID when receiving the row, i.e. in the mapping phase.
Yes, so we do mapping of UIDs, and whatever else needs to be mapped (certain data fields that need to be grouped together), such that at the reduce phase, we have the UIDs properly sharded, and we can do part of the anonymizing at the reduce phase, and the rest in the cloak. Something like that?
In most cases intermediate reductions are needed as well. It isn't as clean and clear-cut as: map map map map reduce
. It's more of a map map merge map map merge map map merge reduce
. Where intermediate states are being partially aggregated, and then merged, before a final reduction takes place.
It is unrealistic to assume that we will ever have all the data available at one time.
It is unrealistic to assume that we will ever have all the data available at one time.
I didn't suggest that we would. I'm only saying that we ought to be able to include UID mapping in the map process somewhere, so that when we reduce, on any given reduce machine we have all of any given user's data there. If we can manage this, then anonymization should be much easier.
So I think before we exclude that possibility, we should test it out in practice. Why not load the taxi database onto a map-reduce system, and play around to see what is possible?
So I think before we exclude that possibility, we should test it out in practice. Why not load the taxi database onto a map-reduce system, and play around to see what is possible?
@ColdSphinX is doing that as we speak.
We won't be able to control sharding. That is something I believe we can consider a fact. Nor will we be able to control query scheduling, or data placement in general. That is kind of the premise of these systems. Hence, I don't think we would ever know whether what we are seeing is a) the complete set of data returned to us for a user? This makes for all sorts of interesting challenges, obviously!
This all sounds very complex.
Except median
and distinct_count
, for the rest of the aggregators (count
, sum
, avg
, min
, max
) we can do a per-user intermediate aggregation. But, in the final step, we still need to gather all the users and intermediate values for a bucket in a single location.
Assuming we select something like 128-256 bytes per user, a machine with 128 GB of RAM should be able to (optimistically) handle 500 millions users (if not in Erlang/Elixir, than in C). Even a 100 millions users limit should handle plenty of cases.
Personally, I would rather we re-write the aggregation routines in C and/or require 128-256 GB of RAM for the cloak rather than writing a distributed aggregation algorithm for each individual backend.
So there are 2 locations in which we aggregate data: per-user (for some functions) and per-bucket. The anonymous bucket needs to wait for all other buckets to be processed.
But please consider also using brute-force to solve the problem (at least for first iteration of the product).
Personally, I would rather we re-write the aggregation routines in C and/or require 128-256 GB of RAM for the cloak rather than writing a distributed aggregation algorithm for each individual backend.
I we don't at the very least do pre-aggregation in the database, then we are going to end up reading all the required data again and again and again. It is going to make us incredibly slow! We already see this in the NYC database case. A query that takes 5s in pure database time, takes two orders of magnitude more time in the cloak (of course other factors play in here too).
But, in the final step, we still need to gather all the users and intermediate values for a bucket in a single location.
For min
, and max
I am not sure if this is true?
We can do some bookkeeping of the intermediate highest and lowest (and some bounded tail amount of users), that can then be refined through intermediate aggregation (hand waving hand waving)...
But please consider also using brute-force to solve the problem (at least for first iteration of the product).
Yes, brute-force is OK, to a certain extent. In this case you consider brute-forcing to be using lots and lots of memory?
A query that takes 5s in pure database time, takes two orders of magnitude more time in the cloak
That is, mostly, a side effect of the platform used (Erlang VM) nd the way our code is written (with no regards towards performance). If the database can do it in 5s, we can probably also do it in under 10-15 seconds in C and/or with very carefully written code (we need to do more processing than the database does). Although, I am not looking forward to implementing that :).
For min, and max am not sure if this is true?
Possibly, but that is the easiest case. What do you propose we do for median or count_distinct?
That is, mostly, a side effect of the platform used (Erlang VM) nd the way our code is written (with no regards towards performance). If the database can do it in 5s, we can probably also do it in under 10-15 seconds in C and/or with very carefully written code (we need to do more processing than the database does). Although, I am not looking forward to implementing that :).
I am not sure that is the case, given that we have to move orders of magnitude more data between the machines... especially when we start getting to sharded multi-host db setups, where our single cloak becomes an IO bottleneck. We can solve that by having multiple cloaks, but then we are in the same distributed aggregation setup land again...
Possibly, but that is the easiest case. What do you propose we do for median or count_distinct?
Hehehehe, I avoided answering those for a reason :)
I basically agree with @cristianberneanu
The biggest benefit of a DB agnostic implementation is that we solve the problem once for all kinds of databases. That's quite a big win, not to mention that in Erlang/C we're Turing complete (so not bound by restriction of a DB query language), so we can handle practically any requirement.
I understand the traffic/memory concerns, but these might be handled by hardware requirements (lot of RAM + large dedicated bandwidth between DB and the cloak). Also, we didn't even focus on optimizing the current code, so it's too early to draw general conclusions from how the current system performs.
So I'd suggest focusing with Elixir first, and if that doesn't work out, consider something like Rust or Go for the job. DB specific optimization should be the last resort, because it's something we need to do per DB, meaning we'll need to implement the algorithm many times. That is going to be very demanding, so we should seriously explore other options first.
DB specific optimization should be the last resort, because it's something we need to do per DB, meaning we'll need to implement the algorithm many times. That is going to be very demanding, so we should seriously explore other options first.
If we use something like Drill, then it's one implementation in order to conquer most BigData solutions out there.
Man oh man, we can't keep on reading terabytes of data each time we want to run a query. I don't think more ram is going to solve this nicely, whatever way we look at it...
But sure, let's try to leverage our current implementation first.
Gang, I want to propose that there is never a reason for moving a lot of data into the cloak (or almost never).
If the data is huge because of lots of users, then we should be filtering out users before moving the data to the cloak.
If the data is huge because each user has lots of data, then the analyst should be applying the WHERE clause a little smarter, and maybe turning one huge query into a series of smaller queries, and doing post-cloak processing to tally up his numbers.
With a little pre-query intelligence, we can solve 99% of the problem right off the bat.
If we take this approach, then we can do the rest brute-force as @cristianberneanu is suggesting. Regarding that, I'd be amazed if there weren't some nice libraries/packages out there that are already fine-tuned for the sorts of things we need to do at anonymization. Namely, basic hash and array functions on what are often predictable data types...
maybe turning one huge query into a series of smaller queries, and doing post-cloak processing to tally up his numbers.
That's just bad user experience - if we can do this automatically we should and not force the user to fiddle with that.
If the data is huge because of lots of users, then we should be filtering out users before moving the data to the cloak.
What about a query like this?
SELECT region, count(*) FROM purchases
There is no sensible way for the user to filter here... yet we basically need to load all the data.
Or alternatively, what about
SELECT count(distinct regions) FROM purchases
Again we will need to load all the data... In this latter case, sampling might be OK, but I am still not fully sold on the idea of sampling data.
and user experience is in fact quite a big deal. We really should not force the user to jump through arbitrary hoops to get to sensible results. That's a sure way to lose customers (sure, maybe they don't have an alternative to using us at the moment, but soon they most likely will)
and user experience is in fact quite a big deal. We really should not force the user to jump through arbitrary hoops to get to sensible results. That's a sure way to lose customers (sure, maybe they don't have an alternative to using us at the moment, but soon they most likely will)
Sure, of course. But it might be a good trade-off to get to market quickly with something that is less user-friendly than wait X months to build a super-distributed solution.
What about a query like this?
SELECT region, count(*) FROM purchases
There is no sensible way for the user to filter here... yet we basically need to load all the data.
But there is:
SELECT region, count(*) FROM purchases WHERE month(date) = 1 AND year(date) = 2015
SELECT region, count(*) FROM purchases WHERE month(date) = 2 AND year(date) = 2015
etc.
Sure, of course. But it might be a good trade-off to get to market quickly with something that is less user-friendly than wait X months to build a super-distributed solution.
It doesn't have to be a super-distributed solution. It should be a solution that doesn't crash under load, and that ideally performs reasonably well. There might be simple tricks we can do to speed up performance dramatically. We owe ourselves to at least think it through before giving up.
There might be simple tricks we can do to speed up performance dramatically. We owe ourselves to at least think it through before giving up.
Absolutely. We should leave every approach on the table, including distributed and centralized.
If we use something like Drill, then it's one implementation in order to conquer most BigData solutions out there.
If Drill will establish a baseline database language we can use to get what we need, possibly without moving a lot of data from the DB, that's certainly a compelling option.
However, this need to be verified. It's not just about the base language, but also about the internal implementation. If Drill itself fetches all rows and processes them internally, then it would be no better than what we'd do in the Cloak.
Gang, I want to propose that there is never a reason for moving a lot of data into the cloak (or almost never).
As I said, the main reason for moving data into the cloak is to solve the problem once. That is a huge benefit, so I wouldn't discard it lightly, especially without exploring performance implications.
If something like Drill can allow us to get the same benefits then it's certainly a viable option. Either way, each option needs to be explored on a large dataset before being accepted/discarded.
If Drill will establish a baseline database language we can use to get what we need, possibly without moving a lot of data from the DB, that's certainly a compelling option.
Drill allows normal SQL. In addition to user defined functions and aggregates (aggregates don't seem to be a stable feature yet). It could act as a unifying layer for a vast number of backends. Definitively something I want to explore.
If Drill itself fetches all rows and processes them internally, then it would be no better than what we'd do in the Cloak.
I am looking at that as we speak. It has a combination of ability to push work down into the native execution engine, but when it comes to user defined functions and aggregates it certainly does everything inside each "drillbit", meaning it loads all the data from the source. The difference here being that there is a "drillbit" co-located with each storage machine. Basically it is very much the same setup we had in our old cloaks. Each can accept and manage a distributed calculation.
These large corporate Hadoop clusters are in the hundreds (or thousands) of nodes. Having work spread across them is definitively going to give us a boost compared to loading the data into a single cloak.
As I said, the main reason for moving data into the cloak is to solve the problem once. That is a huge benefit, so I wouldn't discard it lightly, especially without exploring performance implications.
Agreed, it should definitively not be discarded lightly. I wonder though of some sort of middle-way is possible, where we collapse duplicate rows into one of:
col1, col2, ..., [user-id1, user-id2, ...]
, orcol1, col2, ..., occurrence_count, user_ids_hash
Worth investigating.
Just brainstorming, here is one possible general approach to divide-and-conquer for anonymization:
At the cloak, we understand basic performance limitations (we can handle so many rows, so many distinct users, so much data, etc.) Essentially, we have a good idea of what can be loaded into memory.
When the cloak gets a query, it starts by doing a preliminary query of the database to get some sense of what the size of the data is going to be. If it looks like it will fit into the cloak, then make the query. If not, then consider (automatically) breaking up the query into batches of distinct users. For instance using the equivalent of where mask(hash(uid), 0xf) = 0
, where mask(hash(uid), 0xf) = 1
... where mask(hash(uid), 0xf) = f
. This can work as long as there are enough users in each batch that the noise remains relatively small.
If the data can't be broken up this way, then as a last resort appeal to the analyst...
The difference here being that there is a "drillbit" co-located with each storage machine.
We could do this as well with Cloak.
I wonder though of some sort of middle-way is possible, where we collapse duplicate rows into one of:
col1, col2, ..., [user-id1, user-id2, ...]
, orcol1, col2, ..., occurrence_count, user_ids_hash
IIRC, I actually did latter in PostgreSQL for streaming queries. That certainly reduces the amount of data sent over the wire.
If we use something like Drill, then it's one implementation in order to conquer most BigData solutions out there. Man oh man, we can't keep on reading terabytes of data each time we want to run a query. I don't think more ram is going to solve this nicely, whatever way we look at it...
How is Drill going to help us here? Does it cache intermediate results? Or do you mean we read data on each node instead of a single machine? But wouldn't we need to get it in a cloak anyway?
We could do this as well with Cloak.
We certainly can, but then we dramatically increase the complexity of our software, and become very dependent on the specifics of how different databases are deployed and managed... that's all stuff Drill knows how to do out of the box, and we don't. Putting cloaks for example on each HDFS node would be like re-inventing the wheel, when sports tires already exist.
Secondly, we would then have to implement some sort of distributed version of our anonymization algorithm – which is what this issue exist to discuss the viability of :)
How is Drill going to help us here? Does it cache intermediate results? Or do you mean we read data on each node instead of a single machine? But wouldn't we need to get it in a cloak anyway?
Drill:
a) reduces data in the native database engine if possible b) If custom aggregates are used, then performs reductions on the node where the data is stored
We would then hopefully get a vastly reduced dataset sent to the cloak, and additionally have that vastly reduced dataset be constructed in parallel.
I don't think Drill does any caching.
We certainly can, but then we dramatically increase the complexity of our software, and become very dependent on the specifics of how different databases are deployed and managed... that's all stuff Drill knows how to do out of the box, and we don't. Putting cloaks for example on each HDFS node would be like re-inventing the wheel, when sports tires already exist
Don't all these issues apply to Drill as well? Doesn't it also depend on the specifics of how different dbs are deployed and managed? I.e. other than exposing a unified language, isn't Drill as complex to setup as our system would be?
Secondly, we would then have to implement some sort of distributed version of our anonymization algorithm – which is what this issue exist to discuss the viability of :)
That again seems similar to Drill. If Drill aggregates on the node where the data is stored, why is it more complicated to do it in the Cloak? If Drill reduces the amount of data sent to the central coordinator, why can't we? Other than abstracting the query language (a notable benefit by all means), I don't see how Drill saves us from these issues.
Don't get me wrong, I'm not against Drill per se, but I am a bit cautious about generic db abstraction engines, especially since we plan on dealing with large datasets. While Drill claims it's optimized for speed, I'll take that with a grain of salt. It might be optimized for general cases, but that doesn't mean it's optimized for our case. Having our own local aggregator instead of the generic one gives us much more freedom to optimize the workflow as it suit us and to make different trade-offs. That requires more work, but it also gives us more options.
Don't all these issues apply to Drill as well? Doesn't it also depend on the specifics of how different dbs are deployed and managed? I.e. other than exposing a unified language, isn't Drill as complex to setup as our system would be?
The answer might very well be yes to all of the above. The difference being that this is complexity supported by a large community, and things we can largely outsource. There exist production clusters consisting of many 1000's of nodes both with Drill and Spark. If we can leverage the efforts these communities have put into making their systems work at scale, and then have the ability to write a much simplified cloak as a result, then that seems worth while.
If Drill aggregates on the node where the data is stored, why is it more complicated to do it in the Cloak
I don't think there is any difference in the complexity of aggregation, whether it is in the cloak or in Drill (or an equivalent tool). The aggregation logic has to be written by us in any case. The difference is in that we don't need to manage the coordination, scheduling, and transport.
but I am a bit cautious about generic db abstraction engines
That's probably a healthy attitude.
Having our own local aggregator instead of the generic one gives us much more freedom to optimize the workflow as it suit us and to make different trade-offs
We would have to use our own aggregator within Drill, Spark or equivalent anyway. I.e. this is code supplied by us.
What I have a problem with is having our first product implementation a big-data solution. I would feel a lot better processing normal/average amounts of data for version 1.0.
What I have a problem with is having our first product implementation a big-data solution. I would feel a lot better processing normal/average amounts of data for version 1.0.
Yes! I whole heartedly agree...
We are kind of in this catch-22 position:
As an aside: I also spoke to quite a lot of small companies that were very interested in our solution over the weekend. They frequently operate in the 100GB realm too. Obviously there is a big gap between 100TB and 100GB, that has to be said.
As a second aside: there is a crazy divide between how much small and large companies are willing to pay. We have spoken to small-medium businesses who found 500/mth hard to stomach, whereas when we mention prices well above 10k/mth + support to the large enterprises, it is a complete non-issue.
Did we ever consider enforcing the database server for the analytics database? You already mention in #369 that some clients have a finely tuned SQL backend but don't allow ad-hoc queries to be run. This makes sense, as we don't want to overload the production db.
If another DB for analytics would be needed would it be feasible if we required a particular database technology for our system to be as efficient as possible?
If another DB for analytics would be needed would it be feasible if we required a particular database technology for our system to be as efficient as possible?
This is interesting, even if turns out not to be ultimately viable. If a customer already has a data warehouse or whatever it's called for analytics, they presumably have some kind of mechanism to copy data over into it. That mechanism could be leveraged to push data into "our" database as well.
Precisely, and if they want us to handle terabytes of data with good query performance, then we certainly have some arguments for requiring particular technology for the job.
Precisely, and if they want us to handle terabytes of data with good query performance, then we certainly have some arguments for requiring particular technology for the job.
I am not a fan of this requirement. They probably prefer to throw money at a problem rather than complicate their setup (more servers, more network traffic between them). Having a setup consisting of a single machine that does the processing (and doesn't handle the storage) is very appealing to me.
Having a setup consisting of a single machine that does the processing (and doesn't handle the storage) is very appealing to me.
For that case I think we should seriously consider something like Haskell. Strong static typing might be more important than the distribution features offered by Erlang.
For that case I think we should seriously consider something like Haskell. Strong static typing might be more important than the distribution features offered by Erlang.
Unfortunately very true :( Major change number 1983.
This makes sense, as we don't want to overload the production db.
This was specific to Orange. They in fact already have setup a massive Hadoop environment to support ad-hoc queries. Whether they would setup an additional environment is not clear. Whether we get them as customers isn't clear either though
They probably prefer to throw money at a problem rather than complicate their setup (more servers, more network traffic between them).
With the discussed Drill approach that's already not going to be the case. I also think there will be additional work required on client's behalf, because the database has to be tuned to our specific needs. Otherwise, perf will suck in prod.
Having a setup consisting of a single machine that does the processing (and doesn't handle the storage) is very appealing to me.
All other things being equal, I agree that a simpler single-machine setup is better. However, if we want to deal with a large amount of data and need to keep sensible response times, then the gains of a simplified code that deals with only one type of DB might be worth the cost of a more complicated setup.
This makes sense, as we don't want to overload the production db.
This was specific to Orange.
If we plan on offloading as much work as possible to DB, then it will be an issue for many if not all. Overloading a production db is likely not something we want to do, because we might end up DoS-ing our own client.
Strong static typing might be more important than the distribution features offered by Erlang.
Even on the single machine, our system is supposed to do many different things, such as run multiple queries, and we're also considering moving the Air part to the Cloak as well. Ultimately, it's still going to be a thing that has to run continuously and do different things, so fault-tolerance benefits of Erlang are IMO still desirable.
Using Haskell or something similar for the domain logic (for example compilation and execution of the single query) might be interesting though.
Just to be clear: no major change or pivot or anything is planned. We just need to understand what our options are when we move towards customers with large datasets. Whether that means a pure/optimised cloak, an approach where we do some level of pre-aggregation in the DB, or one where we manage to offload nearly everything into the DB is something we need to learn from experience and experimentation.
Either way, I think we should go with the simplest possible solution. For the time being that presumably means performing anonymization and final aggregation in the cloak. That way we can quickly get started experimenting in more realistic deployment scenarios, which is a prerequisite for being able to later on make informed decisions about what next steps are right for our product.
I think the discussion we have had so far have already brought some quite interesting insights to the table:
And secondly: let's keep our tech stack as a simple as we can for the time being. This means: let's leave everything in elixir as it is today. Once we know where the real bottlenecks actually are, and they are problems we cannot solve in other ways, we can always shift heavy computations into Haskell or C or some such. But let's not make those changes now.
Agree with all this. With the caveat, at the risk of being pedantic, that it isn't the size of the customer's dataset per se that matters, it is the amount of data that the analyst tries to transfer from the dataset to the cloak in a given query.
Agree with all this. With the caveat, at the risk of being pedantic, that it isn't the size of the customer's dataset per se that matters, it is the amount of data that the analyst tries to transfer from the dataset to the cloak in a given query.
Very true. If my own querying is representative though, one usually starts with pretty open ended queries, which are then later on refined as it gets clearer what the data contains. These initial open ended queries (fewer where
-clause restrictions) tend to require more data. But maybe they are exactly the ones where sampling would in fact be beneficial.
Haha, sorry, but this thread begs for it: "it's not the size that matters, but how you use it" (quote, which I in this case will attribute to @yoid2000's last comment)
groan.
On 7/6/2016 1:23 PM, Sebastian Probst Eide wrote:
Haha, sorry, but this thread begs for it: "it's not the size that matters, but how you use it"
— You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHub https://github.com/Aircloak/aircloak/issues/371#issuecomment-230745793, or mute the thread https://github.com/notifications/unsubscribe/ACD-qYw6BYpCV-XY3LsPNA18ByI5bJgbks5qS5BKgaJpZM4JFASR.
This issue is meant to help us hash out and discuss whether our aggregate algorithms can work in a distributed manner.
We would benefit dramatically from pushing these computations into the databases themselves. Especially as we move towards larger NoSQL backends, moving all the data through the cloak's will not be feasible.
If the answer is no, we cannot do these algorithms in a distributed way, then we need to reconsider:
Most of the big data tools allow us to provide our own aggregate functions. These have to deterministically produce a final result, based on only seeing subsets of the data at a time.
One thing that would have to change in our implementations, would be that instead of generating a user-id-hash like we are doing today by sorting and hashing all the user id's, we presumably need to create some mergeable hash-map of the user ids which can then be used at the very end.
So, let's discuss the following aggregates:
Which can, cannot be done in a distributed way?
CC: @Aircloak/developers
https://trello.com/c/rfmJ94fK/7042-can-our-aggregates-be-done-in-the-db