airbnb / reair

ReAir is a collection of easy-to-use tools for replicating tables and partitions between Hive data warehouses.
Apache License 2.0
280 stars 97 forks source link

Create replication jobs in batches #74

Closed saguziel closed 6 years ago

saguziel commented 6 years ago

This publishes create statements in the main loop as batch statements. Tests and checkstyle WIP but all the existing tests pass locally.

It does indeed benchmark well.

saguziel commented 6 years ago

@plypaul @aoen

plypaul commented 6 years ago

Can you post some benchmark before and after?

plypaul commented 6 years ago

Also, seems to be failing build, possibly due to missing some comments around classes / methods.

saguziel commented 6 years ago

Build errors are checkstyle which are WIP. I'll do some higher quality benchmarking but it seems to not change the query time but reduces the number of queries x-fold

saguziel commented 6 years ago

If the approach looks generally okay, I will start adding docs and fixing checkstyles

plypaul commented 6 years ago

Comments will expedite and aid the review process as we can figure out what classes / methods are supposed to do and also better check assumptions.

plypaul commented 6 years ago

Overall approach looks good, but may have missed the clear benefit of using futures here.

saguziel commented 6 years ago

Cool, was mainly looking for a review on overall approach.

Used CompletableFutures because I think the abstraction is cleaner that the deferredCreates return a future of their result rather than a builder where it's dependent on the ordering.

saguziel commented 6 years ago

The benchmarked numbers refer to non-filtered, non-noop entries (ie entries that create a replication job). The case for noop or filtered entries probably isn't changed much.

Before: 30-40 jobs per second After: 600-1200 jobs per second

Benchmark setup: Create 2400 identical audit log entries (type THRIFT_ALTER_TABLE, creates a COPY_UNPARTITIONED_TABLE operation which ends up being NOT_COMPLETABLE after execution), with corresponding INPUT and OUTPUT objects. Run Reair until it says Sleeping for 10000ms because no more entries. Clear replication_jobs table and audit log counter, repeat.

For context, we could process all of last month's events (all converted to non-noop operations) in a few hours

saguziel commented 6 years ago

@plypaul ptal

plypaul commented 6 years ago

LGTM aside from last comment.