ExpediaGroup / waggle-dance

Hive federation service. Enables disparate tables to be concurrently accessed across multiple Hive deployments.
Apache License 2.0
273 stars 76 forks source link

How do Filters, Sorting, Joins, and Merges work with Waggle Dance #121

Closed rajeshsaluja closed 6 years ago

rajeshsaluja commented 6 years ago

I've question on Filter, Sorting, Joins, Merge through waggle-dance

E.g. select * from remotedb.emp where created_date > sysdate - 365 through waggle-dance (emp table exists in remote hive metastore)

Does above statement does filter at remote metastore and then waggle dance get the data or entire table data is pulled before applying filter condition.

Similar question for sorting or joins of tables across metastore. Where does the actual operation is carried out.

Thanks

patduin commented 6 years ago

Waggle Dance only interacts with Hive on the metadata level. The metastores in the federation are simply feeding raw data locations to your Hive execution engine. All actual data manipulation is done on the cluster from which you fire your query. No additional movement of data takes place prior to query execution.

What this mean in practice is that the cluster on which you are running the query requires access to the file stores referenced in the table metadata of your source tables, wherever they may be. For large tables this also implies that you'll require adequate bandwidth to pull the underlying table data across to your execution engine.

Our architectures are normally doing this in a single cloud provider, between different clusters and accounts in the same region where such access is plentiful. We have also experimented with federations across regions.

rajeshsaluja commented 6 years ago

Great . Thanks on the clarification !

rajeshsaluja commented 6 years ago

We are doing experimentation across regions with one of the region in our own data center and other in AWS.