Open wangzhihao opened 7 years ago
hivedump was born as a quick hack to solve a problem. hive and the perl interpreter were available and that's what I've used. But I'm not happy with calling multiple times a command line tool (hive, beeline or whatever) to extract information from the metastore, because it's clumsy, slow and exposed to parsing errors, so I'm inclined to change this.
However I'm not sure that running a separate HiveServer2 is a good choice. Won't it require a proper set of configuration files (hive-site.xml and possibly hdfs-site.xml and more) to access the metastore? Moreover, hivedump is a tool to dump database metadata, so an HiveServer2 instance is supposed to exists somewhere.
Don't forget that HiveRunner is a unit test framework: it just needs Hive logic to formally validate statements and being indipendent and embeddable is a plus.
I would instead investigate other ways like:
Both could be implemented in Java or in Scala.
I would like to read your opinion.
By using JDBC, hivedump could also restore the dump transparently on a second cluster without producing an intermediate HQL script, just by executing each statement in the second environment.
We can run HiveServer2 as a normal java dependency without hive-site.xml. Here is an example code. Even though HiveRunner is a test framework, in principle it just start a HiveServer2 in memory and use it for test purpose. And thus we can use the same trick. We also don't need HDFS and its related settings, since we only manipulate DDL and no DML. You can see from the above code example.
My concern is that the api changes. From the suggested approach, we specify the metastore information, such as the JDBC URL, the driver name, username & password etc. In the existing approach, we only need a hive
shell. These two approaches both have their own strength. I don't see each one can replace the other perfectly. The metastore way removes the need for an external hive (cluster) dependency, which might be the easy-broken point in the whole system. The hive-shell way is more simple and sometimes we don't want to specify the metastore information explicitly.
I'm sorry @wangzhihao for the long delay. I've been busy achieving a certification during the last days. I'll evaluate your proposal asap.
Hi @wangzhihao, I've done some explorations on your idea and in the end I've opted for the JDBC way. Since I'm learning Scala, I've preferred it over Java. You can find the result here: https://github.com/StrumentiResistenti/bear. Let me know if you find it useful or if you need help building it. I've installed Ubuntu packages for scala
and sbt
to build it. All the dependencies are downloaded by sbt
during the compile
phase.
Hi Tx0 Current solution need the external environment which contains
hive
command.It's better to make the tool with zero dependency. just like hiverunner which runs a HiveServer2 in itself JVM, and thus doesn't need any external hive or hadoop installed.
What do you think about?