Spark JDBC connection to Azure Delta Lake issue

CyloNox commented 2 years ago

Expected behavior

Connecting with the Spark JDBC driver and utilizing services to Azure delta lakes tables (by sql end-point or cluster)

Actual behavior

I'm able to connect to the delta lakes in Azure but when the queries are ran in WebAPI I get syntax errors.

This is what it looks on Atlas showing that we can connect:

In the logs however there are syntax errors with the Spark Driver when doing basic queries and unable to grab vocabulary or really do anything:

Steps to reproduce behavior

Use the Spark JDBC driver either by profile or connecting to a new data source.

Proposed Fix that I did:

Modified the pom.xml to include the databricks newest version:

    <!-- https://mvnrepository.com/artifact/com.databricks/databricks-jdbc -->
    <dependency>
      <groupId>com.databricks</groupId>
      <artifactId>databricks-jdbc</artifactId>
      <version>2.6.27</version>
    </dependency>

Then modify the webapi-spark profile to include the databricks version in pom.xml:

    <profile>
      <id>webapi-spark</id>
      <properties>
        <spark.enabled>true</spark.enabled>
        <!-- Spark JDBC driver path -->
        <spark.classpath>${basedir}/src/main/extras/spark</spark.classpath>
        <datasource.dialect>spark</datasource.dialect>
      </properties>
      <dependencies>
        <dependency>
          <groupId>com.databricks</groupId>
          <artifactId>databricks-jdbc</artifactId>
          <version>2.6.27</version>
        </dependency>
      </dependencies>
      <build>
        <plugins>
          <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-install-plugin</artifactId>
            <version>2.5.2</version>
            <executions>
              <execution>
                <id>spark-jdbc</id>
                <phase>initialize</phase>
                <goals>
                  <goal>install-file</goal>
                </goals>
                <configuration>
                  <groupId>com.databricks</groupId>
                  <artifactId>databricks-jdbc</artifactId>
                  <version>2.6.27</version>
                  <packaging>jar</packaging>
                  <file>${spark.classpath}/spark-2.6.27.jar</file>
                </configuration>
              </execution>
            </executions>
          </plugin>
        </plugins>
      </build>
    </profile>

and lastly include the driver information in the DataAccessConfig.java file as a possible driver on line 85: "com.databricks.client.jdbc.Driver"

Once that is done I get solid results when connecting to the data source (you may also modify the connecting to be "jdbc:databricks" instead of "jdbc:spark" when forming the connection string because it's the newer version.

One minor note to point out that while I was able to get this to work it only works through a cluster and not a sql end point. It still throws an error but I may have to dig a bit more as to why that is because they should act the same. Thank you

Quick Fix that can also be used without a code change

I just wanted to also include some knowledge that has been recently passed to me from another site that you can include the parameter "UseNativeQuery=1" in the database connection string that would prevent the syntax error.

alondhe commented 2 years ago

Hi @CyloNox -- I would suggest using the UseNativeQuery=1 flag in the JDBC string. That's what we use in our site, and it's upon this that we submitted the Spark changes to SqlRender/WebAPI.

alex-odysseus commented 1 year ago

Adam, please confirm if a suggestion from Ajit helped @CyloNox

CyloNox commented 1 year ago

I still couldn't get it to work right so I just downloaded the databricks JDBC and added into the code.

Zachary-Higgins commented 7 months ago

I know this issue is approaching it's 1 year anniversary. However, I ran into this tonight and adding the UseNativeQuery=1 works as when testing using our interactive & DBSQL Serverless clusters. Cleared the achilles_cache tables and tried multiple times with success against both clusters.

DBSQL channel: v 2023.50 DB Interactive Cluster runtime ver:13.3 LTS (includes Apache Spark 3.4.1, Scala 2.12)

CyloNox commented 6 months ago

For those still wondering about this one, I suspect this recent change for 2.15 release will resolve this issue: https://github.com/OHDSI/WebAPI/commit/d90d653d1449affd8ef272141cf0af59e635b115

I haven't tested it but when it's official, I will.

OHDSI / WebAPI