Closed obrok closed 6 years ago
I ran a few more tests on the nyctaxi
dataset running on acatlas2
, and although most of the results aren't meaningful in any way, there a couple of things that are worth mentioning:
the min/max
method for computing isolators can be 10 times faster than the current version (242 seconds vs 2160 seconds)
doing a count(*), count(distinct uid), count(discount col1), count(discount col2), count(discount col3)
was more than 10 times slower than computing the isolator status with the optimized method (8219 seconds vs 643 seconds)
I've been playing further with streaming. I've tried a different approach where I ran concurrent queries (one for each column: select uid, col1
and select uid, col2
, ...), and aggregated the results in the cloak.
The nice thing about this approach is that it doesn't cause a significant pressure on the database server. While streaming 13 concurrent queries, a single core on the database server was around 100%, while memory overhead was insignificant. In constrast, a single offloaded isolated query can peg the CPU to 100% and cause a noticeable memory overhead of a few GB.
Therefore, this approach, while slower for a single column, has a benefit of enabling concurrency, which can improve the throughput. Concurrent computation can also reduce the chance of a large column with a lot of variance blocking all other columns, or a large table blocking all other tables.
I only streamed the first 10M records of the trips
table, to keep the test duration at a reasonable time. Some columns were computed in about 90 seconds, while for some it took about 5 minutes. The memory usage in the cloak was about 1 GB.
Projecting the numbers to the full size of the trips table (173M records), the expected time to compute isolated would be from 25 to 90 minutes, which leaves a lot to be desired. More concerning is the fact that we'd need about 17GB of memory in the cloak. The memory requirements might be higher, depending on the nature of the data.
In general, my impression is that streaming has some interesting aspects, mostly because it reduces the pressure on the database server, and also because it's a one-size-fits-all solution. However, the approach seems too slow and requires too much memory for larger datasets. My estimate is that its limit is around 20M rows (2-10 minutes and 2 GB of memory overhead).
The nice thing about this approach is that it doesn't cause a significant pressure on the database server. While streaming 13 concurrent queries, a single core on the database server was around 100%, while memory overhead was insignificant
I wonder how this affects the available disk IOPS on the DB server. For large tables I would think it unlikely that the data is cached, hence, with 13 concurrent queries, you are doing full table scans 13 times in parallel? Do you have any notion of if the per query time increases with concurrency? I also wonder how would affect other (productive) queries running on the system in parallel.
After playing some more with this, my conclusion is that the streaming approach has too much issues and downsides. The last problem I've seen is high CPU usage for virtual/projected table. Issuing concurrent queries in this case can stress the db server, presumably because the server is doing more complex things (e.g. joins, filters, aggregations, etc..) concurrently. This means that the main benefit of the streaming approach (reduced CPU/memory stress on the db server) doesn't even hold for all cases.
The performance still isn't stellar (Cristian reports two hours for our nightly server), but I think at this point it is good enough for us to make a release.
In combination with caching of results, I think using this version is OK.
I deployed the isolator code to @cristianberneanu's big data cloak to check out the performance. Findings:
accounts
table - computing the isolator property for a single column takes on the order of 10 minutes.cstransactions
. It seems it was able to compute the property for some columns in about 5 hours each. For other columns it times out after 12 hours in the database.transactions
, which it didn't even start processing after 3 days) cannot be queried in practice, because the queries will block until isolators are computed for the columns used in the query, which never happens, because of the timeouts.transactions
andcstransactions
tables are projected, which most likely makes it slower than a regular table of comparable size.So it seems like currently the limit with this solution is about 1e8 rows, perhaps 1e9 rows without a projection. We don't necessarily need to do anything about that right now, except maybe documenting. In the near future (this or next milestone) we could try to detect such timeouts and react, perhaps by excluding the given column from querying. I don't think it makes sense to extend the timeout, because it would mean waiting more than 12 hours per column. This assumes we don't figure out a way to make the isolator query faster, which would of course be nice.
For the particular case of something like this
transactions
table we could offer a virtual table that only contains the last N months of data. Because there is an index oncreated_at
querying that table will be much faster.@sasa1977, @sebastian (if you're receiving emails) - WDYT?