clemlabprojects / ambari

Fork of Apache Ambari maintained by Clemlab Company
https://www.clemlab.com
Apache License 2.0
39 stars 15 forks source link

[feat] Integrated MR3-HIVE #41

Open BsoBird opened 6 months ago

BsoBird commented 6 months ago

MR3-HIVE is a novel technology that can greatly improve the efficiency of user's HIVE-SQL execution without changing the way APACHE HIVE is used, even beyond trino. We have used it in a large number of production environments and received good results. As a large number of users are still looking for efficient HIVE engine and struggling, I think we should promote this technology to more users. So that they can benefit from it. https://www.datamonad.com/

lucasbak commented 6 months ago

Hi @BsoBird,

Thanks for the suggestion, we are going to look deeper on this technology.

Best regards,

Clemlab Team

farshad-allahdadi commented 6 months ago

It seems that MR3 is not open source, so maybe not a good idea to use it

BsoBird commented 6 months ago

@farshad-allahdadi Yes, it is a semi-open source product, and its core package, mr3-core, is not open source. But I don't think this is important for three reasons:

  1. MR3 can support users to use up to 5TB of memory space for free. This is sufficient for a large number of small and medium-sized users.
  2. A software project requires financial support if it is to be well maintained, and the author's idea of charging a fee associated with providing services to large customers is reasonable, in my opinion.
  3. By using good tools, users can gain insights that can be used as prototypes to create better products. And the tools themselves need to be improved by the users.

Ambari is a great platform for engaging users and promoting the latest technologies and ideas related to Big Data, and I hope the Clemlab team can give these small teams more opportunities to serve customers, and they can reflect the Clemlab team's technological vision and professionalism in keeping up with the times. Tks.

farshad-allahdadi commented 6 months ago

@BsoBird That was my personal view and concern, which is when I decide to use a piece of software, I need to be sure that either I have access to the source code to fix the possible problems or have access to a minimum level of support from the owner of the software. In case of MR3 how do you handled that without buying the license?

BsoBird commented 6 months ago

@farshad-allahdadi MR3 is currently divided into two parts of code, the first part is the HIVE/TEZ related code that is adapted to MR3, which is open source. I was able to fix the problem by porting a patch from the HIVE/TEZ community. The second part is the core code of MR3. This part of the code is not open source at the moment, but the community will respond with fixes for all non-Lecense related issues. We've been using it for 3 years and have received good feedback. For now, I think both large users with paid versions and small to medium users with free versions can solve their problems.

farshad-allahdadi commented 6 months ago

@BsoBird Thank you, Is it a better alternative to LLAP/Trino/Impala in case of response time? I've read their articles regarding its performance, but I'm curious about your exprience and use case. Did you used it for both ad-hoc/long running (minutes to hours) queries and interactive queries (sub-seconds), if not what other component you had to use beside it (any of Spark/LLAP/Trino/Impala)? Also did you already setup your cluster using Ambari (odp or bigtop) or used tarball installation, I mean is it straightforward to add MR3 to hive in any type of installation?

BsoBird commented 6 months ago

-- Is it a better alternative to LLAP/Trino/Impala in case of response time? yea. Compared to LLAP, it has the same performance as LLAP, but it is easier to install and configure. And it can provide better concurrency. Compared to Trino, HIVE offers more fault tolerance than Trino. For multi table join, it performs better than Trino. Compared to Impala, HIVE has a wider ecosystem than Impala.

--Did you used it for both ad-hoc/long running (minutes to hours) queries and interactive queries (sub-seconds) yea. one hive-llap/mr3, do ac-hoc and batch-etl. Because it provides a resource isolation solution. Users can significantly reduce the introduction of additional technology stacks.

--Also did you already setup your cluster using Ambari (odp or bigtop) or used tarball installation, I mean is it straightforward to add MR3 to hive in any type of installation? It can use the existing HMS, we just need to deploy a HiveServer2 service.