Closed guybartal closed 4 months ago
@anton-abushkevich
Thanks @konstjar but it's already set by default to:
<spark.classpath>${basedir}/src/main/extras/spark</spark.classpath>
And as I wrote above in item number 2 on the list, we've extract and copy the driver to that path.
But we are seeking a way to use WebAPI image or at least build the image without the need to fork the repository and change the source code, for example we saw there's support for Redshift, Postgresql and MS SQL Server in the public image.
@guybartal I'm not sure if this path is correct. We specify in our build exact path.
Also I see there are some copy instruction in after the build. Maybe it causes issues. In our case we build .war file and run application under Tomcat server.
@leeevans Do you have an idea?
@leeevans did you have time to see this? We'd appreciate more clarity on that matter. Thanks!
Hello @konstjar , @leeevans. It seems like some other folks have faced this issue in the past. As you can see here and here the OHDSI webapi repo was forked and changes were made to the docker file, pom file and added the jar file. I think what we are looking for and especially since others tried to tackle it in the past, is a way to build ohdsi WebAPI with spark "almost" out of the box - meaning passing a build argument to the docker file. What are you thoughts? are there any plans on making the ohdsi WebAPI with spark simpler to build, or is it something you are open for us to contribute?
@anatbal As far as I know there are no plans for "spark simpler to build". Feel free to submit PR.
Bumping this as I'm trying to build WebAPI with support for data bricks with @fdefalco. I ran into the same issues as @anatbal @guybartal. I was hoping it would be as simple as using Maven to compile the project with the webapi-spark
profile included in the command. Unfortunately, I've run into issues since that driver (2.6.22.1040) no longer appears to be on maven central. I've made some attempts to update the pom.xml to use the latest JDBC drivers but it is still attempting to find this file in
<spark.classpath>${basedir}/src/main/extras/spark</spark.classpath>
which is not currently a directory under the ${basedir}.
Tagging @TomWhite-MedStar @konstjar hoping they can share any insights for compiling/deploying WebAPI w/ spark support.
@anthonysena I do not think we do something different.
We have Spark drivers in external folder that's why we specify the path using spark.classpath
parameter during build process:
mvn --batch-mode initialize -Dspark.classpath=${SPARK_PATH} -Pwebapi-spark,webapi-snowflake
mvn --batch-mode package -e -X -Dspark.classpath=${SPARK_PATH} -Dmaven.test.skip -Pwebapi-spark,webapi-snowflake
@konstjar , just a clarification on the Maven lifecycle (maybe this worked once but not anymore?)
From pom.xml:
<profile>
<id>webapi-spark</id>
<properties>
<spark.enabled>true</spark.enabled>
<!-- Spark JDBC driver path -->
<spark.classpath>${basedir}/src/main/extras/spark</spark.classpath>
</properties>
<dependencies>
<dependency>
<groupId>com.simba</groupId>
<artifactId>spark</artifactId>
<version>2.6.22.1040</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-install-plugin</artifactId>
<version>3.1.0</version>
<executions>
<execution>
<id>spark-jdbc</id>
<phase>initialize</phase>
<goals>
<goal>install-file</goal>
</goals>
<configuration>
<groupId>com.simba</groupId>
<artifactId>spark</artifactId>
<version>2.6.22.1040</version>
<packaging>jar</packaging>
<file>${spark.classpath}/spark-2.6.22.1040.jar</file>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</profile>
This seems to be calling out that there should be an action during the initialize
phase to perform the install-file
goal, so we'd expect that the artifact would be installed any time the maven toolchain is invoked that would include the 'initalize' phase (which is most of them...maybe not clean, but definitely package). So do you know if there's something that used to work in the past (but no longer) where you can't do the install-file
goal in the same mvn command that will use the installed dependency (ie: the drivers get installed into the m2 cache)? My theory is that we're getting the failure because all dependencies are resolved at mvn launch and any deps that are installed during the build will not be seen in the same build command....which is why it might work for you where you do the nstall (via initialize
) in your first command, and then run the package command in the next (where the deps have been installed into the maven repo via install-file
)
@chrisknoll Thank you for the comments.
We do CI/CD builds internally using pipelines and every time they create build environment from scratch. That's why I can confirm it's not a one time success case.
Since we build WebAPI version with all available profiles, we know that initialize
command should be run in advance. Most of JDBC drivers should be cached in the m2 before package
command. This approach is used for a long time already.
You can see in original message that 2 commands are used initialize
and package
, and I do not think this is a problem.
The only difference is the path to drivers and how they are passed to Maven.
Thanks @konstjar . My question was that is the 2 separate command redundant when the second command should cover the lifecycle phase initialize
because the target of package
should include initialize
.
It's fine if it must be a two-phase build in maven if you're going to use external JDBC drivers, I just don't think our documentation reflects that.
Yes, I can confirm that 2 commands are required when additional JDBC profiles are in use.
Though my understanding about lifecycle phases was the same. I thought it's enough to have just package
.
I'm no expert on maven pipeline, but my guess is that when the command starts up it has a cache of all the installed dependencies and so if anything gets 'added' to the m2 repo during the pipeline, it's not known about it until the process restarts, which is why we need to do it in 2 different invocations. Lame, but it would explain it.
I put together a PR that will both update the driver and remove the maven install
goal discussed here. I was able to deploy a version of WebAPI with a recent Databricks driver and here is the maven command I used:
mvn clean package -s D:/apache-maven/apache-maven-3.9.6/conf/settings.xml -DskipTests=true -DskipUnitTests=true -DskipITtests=true -Pwebapi-postgresql-security-enabled,webapi-solr,webapi-spark
You will notice that all I did was add webapi-spark
to the list of profiles and that was enough to include the Databricks driver into the WebAPI.war.
Update here based on review of #2339, we've decided to add a new profile WebAPI's pom.xml called 'webapi-databricks' to allow for using the latest DataBricks driver. This is to preserve backwards compatibility for those that are currently using the previous Spark driver and may be using a connection string that starts with jdbc:spark...
. Changing that profile would then require changing the respective connection strings, etc.
Once #2339 is merged, the command to build WebAPI w/ the DataBricks driver would look like this:
mvn clean package -s D:/apache-maven/apache-maven-3.9.6/conf/settings.xml -DskipTests=true -DskipUnitTests=true -DskipITtests=true -Pwebapi-postgresql-security-enabled,webapi-solr,webapi-databricks
Expected behavior
Spark drivers included in WebAPI
Actual behavior
We get an error on WebAPI application startup:
Steps to reproduce behavior
Start WebAPI application with ohdsi/webapi:2.12.1 docker image
Description
Steps to support spark source is unclear and undocumented, what should we do to use spark source without changing Dockerfile?
the steps we did to load the spark drivers are:
/src/main/extras/spark/spark-2.6.22.1040.jar
2023-04-30T08:00:58.824720185Z 2023-04-30 08:00:58.824 INFO main org.ohdsi.webapi.DataAccessConfig - [] - driver loaded: com.simba.spark.jdbc.Driver