almond-sh / almond

A Scala kernel for Jupyter
https://almond.sh
BSD 3-Clause "New" or "Revised" License
1.59k stars 239 forks source link

Pre-fetching jars in docker environment fails to populate classpath #1265

Open jpolchlo opened 1 year ago

jpolchlo commented 1 year ago

I want to build a docker environment where I can pre-load the classpath with spark-sql and some other stuff to avoid boilerplate in my notebooks. So I built the following Dockerfile:

FROM almondsh/almond:0.14.0-RC12-scala-2.12.18

RUN coursier fetch org.apache.logging.log4j:log4j-core:2.17.0
RUN coursier fetch org.apache.logging.log4j:log4j-1.2-api:2.17.0
RUN coursier fetch org.apache.spark::spark-sql:3.1.2

However, upon running this container, running import org.apache.spark.sql._ yields an error:

cell1.sc:1: object apache is not a member of package org
import org.apache.spark.sql._
           ^
Compilation Failed

What step am I missing to get Almond to recognize the coursier-installed jars?

kiendang commented 1 year ago

Almond uses a separate directory for cache. coursier fetch by default fetch the artifacts to .cache/coursier (on Linux). You can try to find where almond stores the cache. If I remember correctly it's .cache/almond/coursier then you can do coursier fetch --cache <almond-coursier-cache-dir> ....

jpolchlo commented 1 year ago

That doesn't appear to be the case. Both methods (import from notebook and coursier fetch) place the jar files in the ~/.cache/coursier tree. However, there is a file ~/.cache/almond/ammonite/history that appears to track the notebook imports. The contents after executing

import $ivy.`org.apache.logging.log4j:log4j-core:2.17.0`

are

[
    "import $ivy.`org.apache.logging.log4j:log4j-core:2.17.0`"
]

I'm thinking that the way to pre-load is to provide a notebook with the desired inputs and run it through jupyter during the docker build. There appears to be some amount of state that is created in in-notebook imports that coursier fetch is not replicating.

Edit: I've been able to preload the container with jars using jupyter execute ... on a notebook containing import $ivy... directives. It appears that the import statements in the notebook are still required to register the imported modules in the current context. However, the jar files are now present, and it's not necessary to wait for the maven downloads.

coreyoconnor commented 5 months ago

hmm I did not observe this with the docker image I'm using. However, I'm using

ENV COURSIER_CACHE=/usr/share/coursier/cache

in the dockerfile. Does that impact the coursier cache for even the notebook session?

https://github.com/coreyoconnor/nix_configs/blob/dev/modules/ufo-k8s/almond-2/Dockerfile

coreyoconnor commented 4 months ago

After further testing. Yes, setting ENV COURSIER_CACHE will pre-populate as expected.