Aircloak / aircloak

This repository contains the Aircloak Air frontend as well as the code for our Cloak query and anonymization platform
2 stars 0 forks source link

Isolator cache warmup performance #2837

Closed obrok closed 6 years ago

obrok commented 6 years ago

I deployed the isolator code to @cristianberneanu's big data cloak to check out the performance. Findings:

  1. It works OK for the 34M-row accounts table - computing the isolator property for a single column takes on the order of 10 minutes.
  2. It crashes a lot for the 5G-row cstransactions. It seems it was able to compute the property for some columns in about 5 hours each. For other columns it times out after 12 hours in the database.
    • This makes it so that this table (and transactions, which it didn't even start processing after 3 days) cannot be queried in practice, because the queries will block until isolators are computed for the columns used in the query, which never happens, because of the timeouts.
    • @cristianberneanu notes that the transactions and cstransactions tables are projected, which most likely makes it slower than a regular table of comparable size.

So it seems like currently the limit with this solution is about 1e8 rows, perhaps 1e9 rows without a projection. We don't necessarily need to do anything about that right now, except maybe documenting. In the near future (this or next milestone) we could try to detect such timeouts and react, perhaps by excluding the given column from querying. I don't think it makes sense to extend the timeout, because it would mean waiting more than 12 hours per column. This assumes we don't figure out a way to make the isolator query faster, which would of course be nice.

For the particular case of something like this transactions table we could offer a virtual table that only contains the last N months of data. Because there is an index on created_at querying that table will be much faster.

@sasa1977, @sebastian (if you're receiving emails) - WDYT?

cristianberneanu commented 6 years ago

I ran a few more tests on the nyctaxi dataset running on acatlas2, and although most of the results aren't meaningful in any way, there a couple of things that are worth mentioning:

sasa1977 commented 6 years ago

I've been playing further with streaming. I've tried a different approach where I ran concurrent queries (one for each column: select uid, col1 and select uid, col2, ...), and aggregated the results in the cloak.

The nice thing about this approach is that it doesn't cause a significant pressure on the database server. While streaming 13 concurrent queries, a single core on the database server was around 100%, while memory overhead was insignificant. In constrast, a single offloaded isolated query can peg the CPU to 100% and cause a noticeable memory overhead of a few GB.

Therefore, this approach, while slower for a single column, has a benefit of enabling concurrency, which can improve the throughput. Concurrent computation can also reduce the chance of a large column with a lot of variance blocking all other columns, or a large table blocking all other tables.

I only streamed the first 10M records of the trips table, to keep the test duration at a reasonable time. Some columns were computed in about 90 seconds, while for some it took about 5 minutes. The memory usage in the cloak was about 1 GB.

Projecting the numbers to the full size of the trips table (173M records), the expected time to compute isolated would be from 25 to 90 minutes, which leaves a lot to be desired. More concerning is the fact that we'd need about 17GB of memory in the cloak. The memory requirements might be higher, depending on the nature of the data.

In general, my impression is that streaming has some interesting aspects, mostly because it reduces the pressure on the database server, and also because it's a one-size-fits-all solution. However, the approach seems too slow and requires too much memory for larger datasets. My estimate is that its limit is around 20M rows (2-10 minutes and 2 GB of memory overhead).

sebastian commented 6 years ago

The nice thing about this approach is that it doesn't cause a significant pressure on the database server. While streaming 13 concurrent queries, a single core on the database server was around 100%, while memory overhead was insignificant

I wonder how this affects the available disk IOPS on the DB server. For large tables I would think it unlikely that the data is cached, hence, with 13 concurrent queries, you are doing full table scans 13 times in parallel? Do you have any notion of if the per query time increases with concurrency? I also wonder how would affect other (productive) queries running on the system in parallel.

sasa1977 commented 6 years ago

After playing some more with this, my conclusion is that the streaming approach has too much issues and downsides. The last problem I've seen is high CPU usage for virtual/projected table. Issuing concurrent queries in this case can stress the db server, presumably because the server is doing more complex things (e.g. joins, filters, aggregations, etc..) concurrently. This means that the main benefit of the streaming approach (reduced CPU/memory stress on the db server) doesn't even hold for all cases.

sebastian commented 6 years ago

The performance still isn't stellar (Cristian reports two hours for our nightly server), but I think at this point it is good enough for us to make a release.

sebastian commented 6 years ago

In combination with caching of results, I think using this version is OK.