Propose: run hive in memory with zero external dependency

wangzhihao commented 7 years ago

Hi Tx0 Current solution need the external environment which contains hive command.

It's better to make the tool with zero dependency. just like hiverunner which runs a HiveServer2 in itself JVM, and thus doesn't need any external hive or hadoop installed.

What do you think about?

StrumentiResistenti commented 7 years ago

hivedump was born as a quick hack to solve a problem. hive and the perl interpreter were available and that's what I've used. But I'm not happy with calling multiple times a command line tool (hive, beeline or whatever) to extract information from the metastore, because it's clumsy, slow and exposed to parsing errors, so I'm inclined to change this.

However I'm not sure that running a separate HiveServer2 is a good choice. Won't it require a proper set of configuration files (hive-site.xml and possibly hdfs-site.xml and more) to access the metastore? Moreover, hivedump is a tool to dump database metadata, so an HiveServer2 instance is supposed to exists somewhere.

Don't forget that HiveRunner is a unit test framework: it just needs Hive logic to formally validate statements and being indipendent and embeddable is a plus.

I would instead investigate other ways like:

using JDBC to chat with HiveServer2
going at a lower level and use the metastore API to fetch metadata

Both could be implemented in Java or in Scala.

I would like to read your opinion.

StrumentiResistenti commented 7 years ago

By using JDBC, hivedump could also restore the dump transparently on a second cluster without producing an intermediate HQL script, just by executing each statement in the second environment.

wangzhihao commented 7 years ago

We can run HiveServer2 as a normal java dependency without hive-site.xml. Here is an example code. Even though HiveRunner is a test framework, in principle it just start a HiveServer2 in memory and use it for test purpose. And thus we can use the same trick. We also don't need HDFS and its related settings, since we only manipulate DDL and no DML. You can see from the above code example.

My concern is that the api changes. From the suggested approach, we specify the metastore information, such as the JDBC URL, the driver name, username & password etc. In the existing approach, we only need a hive shell. These two approaches both have their own strength. I don't see each one can replace the other perfectly. The metastore way removes the need for an external hive (cluster) dependency, which might be the easy-broken point in the whole system. The hive-shell way is more simple and sometimes we don't want to specify the metastore information explicitly.

StrumentiResistenti commented 7 years ago

I'm sorry @wangzhihao for the long delay. I've been busy achieving a certification during the last days. I'll evaluate your proposal asap.

StrumentiResistenti commented 7 years ago

Hi @wangzhihao, I've done some explorations on your idea and in the end I've opted for the JDBC way. Since I'm learning Scala, I've preferred it over Java. You can find the result here: https://github.com/StrumentiResistenti/bear. Let me know if you find it useful or if you need help building it. I've installed Ubuntu packages for scala and sbt to build it. All the dependencies are downloaded by sbt during the compile phase.

StrumentiResistenti / hivedump

Propose: run hive in memory with zero external dependency #6