apache / drill

Apache Drill is a distributed MPP query layer for self describing data
https://drill.apache.org/
Apache License 2.0
1.93k stars 980 forks source link

Expose `org.apache.drill.test` artifact so that end-users can use `ClusterFixtureBuilder` to create embedded Drill applications #2446

Open GavinRay97 opened 2 years ago

GavinRay97 commented 2 years ago

Hello, I would like to embed Drill in a JVM application, running as a single node in-memory. I will feed it Calcite RelNode relational expressions to execute that my application is generating.

Browsing the code to try to find out how best to go about this, I found in ClusterFixtureBuilder.java:

(If this isn't the best/easiest way to embed a single Drill node please let me know and I will delete this issue 😅)

https://github.com/apache/drill/blob/2decae18b85eeda51816e92d5a9e9e6e2f9ce8d5/exec/java-exec/src/test/java/org/apache/drill/test/ClusterFixtureBuilder.java#L29-L43

https://github.com/apache/drill/blob/2decae18b85eeda51816e92d5a9e9e6e2f9ce8d5/exec/java-exec/src/test/java/org/apache/drill/test/ClusterFixtureBuilder.java#L279-L301

But it looks like there is no Maven artifact or .jar to download to include this functionality as an end user =/

I tried to copy-paste the primary classes, but there is a spiderweb of dependencies through out the org.apache.drill.test and org.apache.drill.exec.testing packages.

jnturton commented 2 years ago

There was some discussion in the mailing list in Novemever that might help you, maybe you can collaborate with @rymarm (who sent the emails) on this...

EDIT: Oh dear, it looks like Pony Mail isn't very good at URLs. But try searching for "Start embedded Drill on JDBC connection" with a date range of "the last year" here:

https://lists.apache.org/list?dev@drill.apache.org

paul-rogers commented 2 years ago

Great idea. I wrote a lot of that code originally, let me know if you have questions. The dependencies might be related to things like the test-only row set classes, integrations with JUnit for temporary directories and so on. It may be possible to split the class so that one class has only those bits and pieces needed for client apps. and a subclass adds the additional parts used by tests.

rymarm commented 2 years ago

@GavinRay97 several months before I have found that Drill can be run in "embedded mode" with a pretty simple configuration. To achieve this, you need to add the next dependencies to your project pom.xml:

<dependencies>
        <dependency>
            <groupId>org.apache.drill.exec</groupId>
            <artifactId>drill-java-exec</artifactId>
            <version>1.19.0</version>
            <exclusions>
                <exclusion>
                    <groupId>org.slf4j</groupId>
                    <artifactId>log4j-over-slf4j</artifactId>
                </exclusion>
                <exclusion>
                    <groupId>org.slf4j</groupId>
                    <artifactId>slf4j-log4j12</artifactId>
                </exclusion>
            </exclusions>
        </dependency>
        <dependency>
            <groupId>org.apache.drill.exec</groupId>
            <artifactId>drill-jdbc</artifactId>
            <version>1.19.0</version>
            <exclusions>
                <exclusion>
                    <groupId>org.slf4j</groupId>
                    <artifactId>log4j-over-slf4j</artifactId>
                </exclusion>
                <exclusion>
                    <groupId>org.slf4j</groupId>
                    <artifactId>slf4j-log4j12</artifactId>
                </exclusion>
            </exclusions>
        </dependency>
        <dependency>
            <groupId>com.google.guava</groupId>
            <artifactId>guava</artifactId>
            <version>21.0</version>
        </dependency>
    </dependencies>

And after that, you will be able to run embedded Drill with the following example code:

    // This part is responsible for running the embedded Drill and establishing a connection to it.
    // "jdbc:drill:zk=local" is connection string to run embedded Drill 
    Connection connection = DriverManager.getConnection("jdbc:drill:zk=local"); 
    // Example of executing simple query
    Statement st = connection.createStatement();
    // `/home/maksym/Desktop/sample.csv` is path to csv file that I've created for the example
    ResultSet rs = st.executeQuery("select * from dfs.`/home/maksym/Desktop/sample.csv`");
    while (rs.next()) {
      System.out.println(rs.getString(1));
    }
    connection.close();

I didn't find exhausting information on how exactly should be configured application: what dependencies are required, what properties are available, and so on. But you can dive into code and look at how embedded mode was implemented. Here is the departure point: https://github.com/apache/drill/blob/15b2f52260e4f0026f2dfafa23c5d32e0fb66502/exec/jdbc/src/main/java/org/apache/drill/jdbc/impl/DrillConnectionImpl.java#L104

Besides this, you also find many Jira tickets that belong to issues with embedded Drill, here are several of them: DRILL-2126, DRILL-1654, DRILL-1409

According to my investigation of code and manual tests, it seems, that embedded Drill works pretty well and the only issue is dependency conflicts, that is why in my example above, I added guava and excluded log jars.

I would like to gather as much information as possible about embedded Drill and add it to Drill documentation or make some code improvements to let users freely use this mode for their application. Of course, Drill was created as a distributed system, but Drill is so powerful tool that is also very useful even in single, embedded node mode.

paul-rogers commented 2 years ago

Thanks @rymarm for the info! This is one of those cases where bug becomes a feature. The reason embedded Drill works via JDBC is that most of Drill ends up getting sucked into the JDBC driver for no good reason other than that the RPC code depends on everything else. That's lucky for you, but not so great for folks who just want a simple JDBC driver.

As it turns out, the reason that SqlLine can run an embedded Drill is because the JDBC driver contains all the code. But, do we want a JDBC driver to include a Spunk connector, a PDF reader, support for Hadoop and all the rest? Kind of creates a rather fat client, and all those libraries conflict with that the surrounding app wants to do. This is why the JDBC driver build chucks a bunch of dependencies overboard.

At some point (maybe Drill 2.0?) we need to create a simpler JDBC driver. At that point, the mechanism that @GavinRay97 original requested will be needed to start the server that the JDBC driver then connects to. We're not there now (far from it), but that's kind of where we should head. (There is a whole vector discussion that includes this topic.)