OHDSI / WebAPI

OHDSI WebAPI contains all OHDSI services that can be called from OHDSI applications
Apache License 2.0
126 stars 156 forks source link

Unclear spark/databricks support #2262

Closed guybartal closed 4 months ago

guybartal commented 1 year ago

Expected behavior

Spark drivers included in WebAPI

Actual behavior

We get an error on WebAPI application startup:

2023-04-30T08:00:58.824720185Z 2023-04-30 08:00:58.824 INFO main org.ohdsi.webapi.DataAccessConfig - [] - error loading com.simba.spark.jdbc.Driver

Steps to reproduce behavior

Start WebAPI application with ohdsi/webapi:2.12.1 docker image

Description

Steps to support spark source is unclear and undocumented, what should we do to use spark source without changing Dockerfile?

the steps we did to load the spark drivers are:

  1. Download JDBC spark drivers (version 2.6.22) from Databricks website: https://www.databricks.com/spark/jdbc-drivers-archive
  2. Extract and rename jar driver file to /src/main/extras/spark/spark-2.6.22.1040.jar
  3. Edit Dockerfile:
# Download dependencies
COPY pom.xml /code/
# RUN mkdir .git \
#     && mvn package \
#      -P${MAVEN_PROFILE}
# Compile code and repackage it
COPY src /code/src
RUN mvn initialize \
    -Dgit.branch=${GIT_BRANCH} \
    -Dgit.commit.id.abbrev=${GIT_COMMIT_ID_ABBREV} \
    -P${MAVEN_PROFILE} \
    && mvn package \
    -Dgit.branch=${GIT_BRANCH} \
    -Dgit.commit.id.abbrev=${GIT_COMMIT_ID_ABBREV} \
    -P${MAVEN_PROFILE} \
    && mkdir war \
    && mv target/WebAPI.war war \
    && cd war \
    && jar -xf WebAPI.war \
    && rm WebAPI.war
  1. Build the Dockerfile with the following build args and maven profile:
docker build -t webapi --build-arg MAVEN_PROFILE=webapi-docker,webapi-spark --build-arg GIT_BRANCH=v2.12.1 .
  1. Run the custom image and see startup logs: 2023-04-30T08:00:58.824720185Z 2023-04-30 08:00:58.824 INFO main org.ohdsi.webapi.DataAccessConfig - [] - driver loaded: com.simba.spark.jdbc.Driver
guybartal commented 1 year ago

@anton-abushkevich

konstjar commented 1 year ago

You may need to specify the exact path to driver in pom.xml

guybartal commented 1 year ago

Thanks @konstjar but it's already set by default to:

<spark.classpath>${basedir}/src/main/extras/spark</spark.classpath>

And as I wrote above in item number 2 on the list, we've extract and copy the driver to that path.

But we are seeking a way to use WebAPI image or at least build the image without the need to fork the repository and change the source code, for example we saw there's support for Redshift, Postgresql and MS SQL Server in the public image.

konstjar commented 1 year ago

@guybartal I'm not sure if this path is correct. We specify in our build exact path.

Also I see there are some copy instruction in after the build. Maybe it causes issues. In our case we build .war file and run application under Tomcat server.

@leeevans Do you have an idea?

guybartal commented 1 year ago

@leeevans did you have time to see this? We'd appreciate more clarity on that matter. Thanks!

anatbal commented 1 year ago

Hello @konstjar , @leeevans. It seems like some other folks have faced this issue in the past. As you can see here and here the OHDSI webapi repo was forked and changes were made to the docker file, pom file and added the jar file. I think what we are looking for and especially since others tried to tackle it in the past, is a way to build ohdsi WebAPI with spark "almost" out of the box - meaning passing a build argument to the docker file. What are you thoughts? are there any plans on making the ohdsi WebAPI with spark simpler to build, or is it something you are open for us to contribute?

konstjar commented 1 year ago

@anatbal As far as I know there are no plans for "spark simpler to build". Feel free to submit PR.

anthonysena commented 5 months ago

Bumping this as I'm trying to build WebAPI with support for data bricks with @fdefalco. I ran into the same issues as @anatbal @guybartal. I was hoping it would be as simple as using Maven to compile the project with the webapi-spark profile included in the command. Unfortunately, I've run into issues since that driver (2.6.22.1040) no longer appears to be on maven central. I've made some attempts to update the pom.xml to use the latest JDBC drivers but it is still attempting to find this file in

<spark.classpath>${basedir}/src/main/extras/spark</spark.classpath>

which is not currently a directory under the ${basedir}.

Tagging @TomWhite-MedStar @konstjar hoping they can share any insights for compiling/deploying WebAPI w/ spark support.

konstjar commented 5 months ago

@anthonysena I do not think we do something different.

We have Spark drivers in external folder that's why we specify the path using spark.classpath parameter during build process:

mvn --batch-mode initialize -Dspark.classpath=${SPARK_PATH} -Pwebapi-spark,webapi-snowflake mvn --batch-mode package -e -X -Dspark.classpath=${SPARK_PATH} -Dmaven.test.skip -Pwebapi-spark,webapi-snowflake

konstjar commented 5 months ago

And you can grab the drivers manually here

chrisknoll commented 5 months ago

@konstjar , just a clarification on the Maven lifecycle (maybe this worked once but not anymore?)

From pom.xml:

    <profile>
      <id>webapi-spark</id>
      <properties>
        <spark.enabled>true</spark.enabled>
        <!-- Spark JDBC driver path -->
        <spark.classpath>${basedir}/src/main/extras/spark</spark.classpath>
      </properties>
      <dependencies>
        <dependency>
          <groupId>com.simba</groupId>
          <artifactId>spark</artifactId>
          <version>2.6.22.1040</version>
        </dependency>
      </dependencies>
      <build>
        <plugins>
          <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-install-plugin</artifactId>
            <version>3.1.0</version>
            <executions>
              <execution>
                <id>spark-jdbc</id>
                <phase>initialize</phase>
                <goals>
                  <goal>install-file</goal>
                </goals>
                <configuration>
                  <groupId>com.simba</groupId>
                  <artifactId>spark</artifactId>
                  <version>2.6.22.1040</version>
                  <packaging>jar</packaging>
                  <file>${spark.classpath}/spark-2.6.22.1040.jar</file>
                </configuration>
              </execution>
            </executions>
          </plugin>
        </plugins>
      </build>
    </profile>

This seems to be calling out that there should be an action during the initialize phase to perform the install-file goal, so we'd expect that the artifact would be installed any time the maven toolchain is invoked that would include the 'initalize' phase (which is most of them...maybe not clean, but definitely package). So do you know if there's something that used to work in the past (but no longer) where you can't do the install-file goal in the same mvn command that will use the installed dependency (ie: the drivers get installed into the m2 cache)? My theory is that we're getting the failure because all dependencies are resolved at mvn launch and any deps that are installed during the build will not be seen in the same build command....which is why it might work for you where you do the nstall (via initialize) in your first command, and then run the package command in the next (where the deps have been installed into the maven repo via install-file)

konstjar commented 5 months ago

@chrisknoll Thank you for the comments.

We do CI/CD builds internally using pipelines and every time they create build environment from scratch. That's why I can confirm it's not a one time success case.

Since we build WebAPI version with all available profiles, we know that initialize command should be run in advance. Most of JDBC drivers should be cached in the m2 before package command. This approach is used for a long time already.

You can see in original message that 2 commands are used initialize and package , and I do not think this is a problem. The only difference is the path to drivers and how they are passed to Maven.

chrisknoll commented 5 months ago

Thanks @konstjar . My question was that is the 2 separate command redundant when the second command should cover the lifecycle phase initialize because the target of package should include initialize.

It's fine if it must be a two-phase build in maven if you're going to use external JDBC drivers, I just don't think our documentation reflects that.

konstjar commented 5 months ago

Yes, I can confirm that 2 commands are required when additional JDBC profiles are in use.

Though my understanding about lifecycle phases was the same. I thought it's enough to have just package.

chrisknoll commented 5 months ago

I'm no expert on maven pipeline, but my guess is that when the command starts up it has a cache of all the installed dependencies and so if anything gets 'added' to the m2 repo during the pipeline, it's not known about it until the process restarts, which is why we need to do it in 2 different invocations. Lame, but it would explain it.

anthonysena commented 5 months ago

I put together a PR that will both update the driver and remove the maven install goal discussed here. I was able to deploy a version of WebAPI with a recent Databricks driver and here is the maven command I used:

mvn clean package -s D:/apache-maven/apache-maven-3.9.6/conf/settings.xml -DskipTests=true -DskipUnitTests=true -DskipITtests=true -Pwebapi-postgresql-security-enabled,webapi-solr,webapi-spark

You will notice that all I did was add webapi-spark to the list of profiles and that was enough to include the Databricks driver into the WebAPI.war.

anthonysena commented 5 months ago

Update here based on review of #2339, we've decided to add a new profile WebAPI's pom.xml called 'webapi-databricks' to allow for using the latest DataBricks driver. This is to preserve backwards compatibility for those that are currently using the previous Spark driver and may be using a connection string that starts with jdbc:spark.... Changing that profile would then require changing the respective connection strings, etc.

Once #2339 is merged, the command to build WebAPI w/ the DataBricks driver would look like this:

mvn clean package -s D:/apache-maven/apache-maven-3.9.6/conf/settings.xml -DskipTests=true -DskipUnitTests=true -DskipITtests=true -Pwebapi-postgresql-security-enabled,webapi-solr,webapi-databricks