OHDSI / WebAPI

OHDSI WebAPI contains all OHDSI services that can be called from OHDSI applications
Apache License 2.0
126 stars 156 forks source link

Spark JDBC connection to Azure Delta Lake issue #2055

Open CyloNox opened 1 year ago

CyloNox commented 1 year ago

Expected behavior

Connecting with the Spark JDBC driver and utilizing services to Azure delta lakes tables (by sql end-point or cluster)

Actual behavior

I'm able to connect to the delta lakes in Azure but when the queries are ran in WebAPI I get syntax errors.

This is what it looks on Atlas showing that we can connect: image

In the logs however there are syntax errors with the Spark Driver when doing basic queries and unable to grab vocabulary or really do anything: image

Steps to reproduce behavior

Use the Spark JDBC driver either by profile or connecting to a new data source.

Proposed Fix that I did:

Modified the pom.xml to include the databricks newest version:

    <!-- https://mvnrepository.com/artifact/com.databricks/databricks-jdbc -->
    <dependency>
      <groupId>com.databricks</groupId>
      <artifactId>databricks-jdbc</artifactId>
      <version>2.6.27</version>
    </dependency>

Then modify the webapi-spark profile to include the databricks version in pom.xml:

    <profile>
      <id>webapi-spark</id>
      <properties>
        <spark.enabled>true</spark.enabled>
        <!-- Spark JDBC driver path -->
        <spark.classpath>${basedir}/src/main/extras/spark</spark.classpath>
        <datasource.dialect>spark</datasource.dialect>
      </properties>
      <dependencies>
        <dependency>
          <groupId>com.databricks</groupId>
          <artifactId>databricks-jdbc</artifactId>
          <version>2.6.27</version>
        </dependency>
      </dependencies>
      <build>
        <plugins>
          <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-install-plugin</artifactId>
            <version>2.5.2</version>
            <executions>
              <execution>
                <id>spark-jdbc</id>
                <phase>initialize</phase>
                <goals>
                  <goal>install-file</goal>
                </goals>
                <configuration>
                  <groupId>com.databricks</groupId>
                  <artifactId>databricks-jdbc</artifactId>
                  <version>2.6.27</version>
                  <packaging>jar</packaging>
                  <file>${spark.classpath}/spark-2.6.27.jar</file>
                </configuration>
              </execution>
            </executions>
          </plugin>
        </plugins>
      </build>
    </profile>

and lastly include the driver information in the DataAccessConfig.java file as a possible driver on line 85: "com.databricks.client.jdbc.Driver"

Once that is done I get solid results when connecting to the data source (you may also modify the connecting to be "jdbc:databricks" instead of "jdbc:spark" when forming the connection string because it's the newer version. image

One minor note to point out that while I was able to get this to work it only works through a cluster and not a sql end point. It still throws an error but I may have to dig a bit more as to why that is because they should act the same. Thank you

Quick Fix that can also be used without a code change

I just wanted to also include some knowledge that has been recently passed to me from another site that you can include the parameter "UseNativeQuery=1" in the database connection string that would prevent the syntax error.

alondhe commented 1 year ago

Hi @CyloNox -- I would suggest using the UseNativeQuery=1 flag in the JDBC string. That's what we use in our site, and it's upon this that we submitted the Spark changes to SqlRender/WebAPI.

alex-odysseus commented 1 year ago

Adam, please confirm if a suggestion from Ajit helped @CyloNox

CyloNox commented 1 year ago

I still couldn't get it to work right so I just downloaded the databricks JDBC and added into the code.

Zachary-Higgins commented 3 months ago

I know this issue is approaching it's 1 year anniversary. However, I ran into this tonight and adding the UseNativeQuery=1 works as when testing using our interactive & DBSQL Serverless clusters. Cleared the achilles_cache tables and tried multiple times with success against both clusters.

DBSQL channel: v 2023.50 DB Interactive Cluster runtime ver:13.3 LTS (includes Apache Spark 3.4.1, Scala 2.12)

CyloNox commented 1 month ago

For those still wondering about this one, I suspect this recent change for 2.15 release will resolve this issue: https://github.com/OHDSI/WebAPI/commit/d90d653d1449affd8ef272141cf0af59e635b115

I haven't tested it but when it's official, I will.