databrickslabs / mosaic

An extension to the Apache Spark framework that allows easy and fast processing of very large geospatial datasets.
https://databrickslabs.github.io/mosaic/
Other
269 stars 66 forks source link

Circular dependency for installing GDAL using mosaic.setup_gdal() #524

Open smartkiwi opened 7 months ago

smartkiwi commented 7 months ago

GDAL installation helper is not usable as part of mosaic library Currently mosaic.setup_gdal() helper requires GDAL to installed. This makes it difficult to use for users.

Versions:

Install GDAL documentation doesn't work because of this. https://github.com/databrickslabs/mosaic/blob/main/docs/source/usage/install-gdal.rst

To Reproduce Steps to reproduce the behavior: Running %pip install databricks-mosaic in Databricks Notebook in vanilla DBR 13.3 fails with error that GLAD not found.

Expected behavior Documentation and tooling should be improved to allow users to install GDAL first without requiring to install mosaic. Or there should be some way to install mosaic library without GDAL dependencies to allow users to use mosaic.setup_gdal function

mjohns-databricks commented 7 months ago

These are the instructions - https://databrickslabs.github.io/mosaic/usage/install-gdal.html, it is not circular.

smartkiwi commented 7 months ago

Maybe subject string is not best. I've updated subject.

Let me clarify the problem I face with DBR 13.3. Currently https://databrickslabs.github.io/mosaic/usage/install-gdal.html tells to run the following steps in order to install GDAL on worker nodes.

But instructions do not include details how to install mosaic on driver node.

On DBR 13.3 (which has no GDAL library) user cannot install mosaic, thus user cannot run mos.setup_gdal()

import mosaic as mos

mos.enable_mosaic(spark, dbutils)
mos.setup_gdal()
arr175 commented 2 weeks ago

@smartkiwi did you ever figure this out? I'm stuck at the same location.

mjohns-databricks commented 2 weeks ago

Again, not a circular dependency. The following is what the docs are conveying:

  1. %pip install databricks-mosaic
  2. (1x setup)
    
    import mosaic as mos

mos.enable_mosaic(spark, dbutils) mos.setup_gdal()

3. add the generated init script path to your cluster and restart your cluster
4. (after restart)

import mosaic as mos

mos.enable_mosaic(spark, dbutils) mos.enable_gdal(spark)

---

Providing the signature to the code for `setup_gdal` under [gdal.py](https://github.com/databrickslabs/mosaic/blob/main/python/mosaic/api/gdal.py) to further de-mystify:

def setup_gdal( to_fuse_dir: str = "/Workspace/Shared/geospatial/mosaic/gdal/jammy/0.4.2", script_out_name: str = "mosaic-gdal-init.sh", jni_so_copy: bool = False, test_mode: bool = False ) -> bool: """ Prepare GDAL init script and shared objects required for GDAL to run on spark. This function will generate the init script that will install GDAL on each worker node. After the setup_gdal is run, the init script must be added to the cluster; also, a cluster restart is required.

Notes:
  (a) This is close in behavior to Mosaic < 0.4 series (prior to DBR 13),
      now using jammy default (3.4.1)
  (b) `to_fuse_dir` can be one of `/Volumes/..`, `/Workspace/..`, `/dbfs/..`;
       however, you should use `setup_fuse_install()` for Volume based installs

Parameters
----------
to_fuse_dir : str
        Path to write out the init script for GDAL installation;
        default is '/Workspace/Shared/geospatial/mosaic/gdal/jammy/0.4.2'.
script_out_name : str
        name of the script to be written;
        default is 'mosaic-gdal-init.sh'.
jni_so_copy : bool
        if True, copy shared object to fuse dir and config script to use;
        default is False
test_mode : bool
        Only for unit tests.

Returns
-------
True unless resources fail to download.
"""
mjohns-databricks commented 2 weeks ago

If you are running on a "Single Node" spark instance (vs cluster) and do not want to setup an init script then just manually run the contents of the generated script in a cell in your notebook, from here, e.g. something like the following (you are root when running in the notebook, so no sudo):

%sh
apt-add-repository -y "deb http://archive.ubuntu.com/ubuntu $(lsb_release -sc)-backports main universe multiverse restricted"
apt-add-repository -y "deb http://archive.ubuntu.com/ubuntu $(lsb_release -sc)-updates main universe multiverse restricted"
apt-add-repository -y "deb http://archive.ubuntu.com/ubuntu $(lsb_release -sc)-security main multiverse restricted universe"
apt-add-repository -y "deb http://archive.ubuntu.com/ubuntu $(lsb_release -sc) main multiverse restricted universe"
apt-get update -y

apt-get -o DPkg::Lock::Timeout=-1 install -y unixodbc libcurl3-gnutls libsnappy-dev libopenjp2-7
apt-get -o DPkg::Lock::Timeout=-1 install -y gdal-bin libgdal-dev python3-numpy python3-gdal

pip install --upgrade pip
pip install gdal==3.4.1

GITHUB_REPO_PATH=databrickslabs/mosaic/main/resources/gdal/jammy
wget -nv -P /usr/lib -nc https://raw.githubusercontent.com/$GITHUB_REPO_PATH/libgdalalljni.so
wget -nv -P /usr/lib -nc https://raw.githubusercontent.com/$GITHUB_REPO_PATH/libgdalalljni.so.30
wget -nv -P /usr/lib -nc https://raw.githubusercontent.com/$GITHUB_REPO_PATH/libgdalalljni.so.30.0.3

Then no cluster restart needed for mos.enable_gdal(spark) ...

arr175 commented 1 week ago

Hi Michael,

Thanks for sharing this so quickly. The issue that I'm having is on the first step, %pip install databricks-mosaic , it does not install mosaic due to missing GDAL. This is on a new DBR 13.3 shared compute. If there's a specific setting that I need to request from our IT team, please let me know. See below for the error I'm getting on the first step.

Note: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.
Collecting databricks-mosaic
  Downloading databricks_mosaic-0.4.2-py3-none-any.whl.metadata (828 bytes)
Collecting geopandas<0.14.4,>=0.14 (from databricks-mosaic)
  Downloading geopandas-0.14.3-py3-none-any.whl.metadata (1.5 kB)
Collecting h3<4.0,>=3.7 (from databricks-mosaic)
  Downloading h3-3.7.7-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.metadata (4.9 kB)
Requirement already satisfied: ipython>=7.22.0 in /databricks/python3/lib/python3.10/site-packages (from databricks-mosaic) (8.10.0)
Collecting keplergl==0.3.2 (from databricks-mosaic)
  Downloading keplergl-0.3.2.tar.gz (9.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9.7/9.7 MB 74.0 MB/s eta 0:00:00
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collecting pyspark<3.5,>=3.4 (from databricks-mosaic)
  Downloading pyspark-3.4.3.tar.gz (311.4 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 311.4/311.4 MB 49.2 MB/s eta 0:00:00
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Requirement already satisfied: ipywidgets<8,>=7.0.0 in /databricks/python3/lib/python3.10/site-packages (from keplergl==0.3.2->databricks-mosaic) (7.7.2)
Collecting traittypes>=0.2.1 (from keplergl==0.3.2->databricks-mosaic)
  Downloading traittypes-0.2.1-py2.py3-none-any.whl.metadata (1.0 kB)
Requirement already satisfied: pandas>=0.23.0 in /databricks/python3/lib/python3.10/site-packages (from keplergl==0.3.2->databricks-mosaic) (1.4.4)
Collecting Shapely>=1.6.4.post2 (from keplergl==0.3.2->databricks-mosaic)
  Downloading shapely-2.0.6-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.metadata (7.0 kB)
Collecting fiona>=1.8.21 (from geopandas<0.14.4,>=0.14->databricks-mosaic)
  Downloading fiona-1.9.6.tar.gz (411 kB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'error'
  error: subprocess-exited-with-error

  × Getting requirements to build wheel did not run successfully.
  │ exit code: 1
  ╰─> [3 lines of output]
      <string>:86: DeprecationWarning: The 'warn' function is deprecated, use 'warning' instead
      WARNING:root:Failed to get options via gdal-config: [Errno 2] No such file or directory: 'gdal-config'
      CRITICAL:root:A GDAL API version must be specified. Provide a path to gdal-config using a GDAL_CONFIG environment variable or use a GDAL_VERSION environment variable.
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.
mjohns-databricks commented 1 week ago

Referencing Installation Guide, please run on an Assigned Cluster and see if that clears up your issue. Also, refer to pending release of 0.4.3 PR #568 for any additional python library "version fixing" that might now be required in DBR 13.3 (notably we are going to be identifying a range for numpy as version 2.0 is no longer compatible with scikit-learn version installed).

arr175 commented 22 hours ago

@mjohns-databricks thanks for recommending running on an Assigned Cluster. Initially it didn't work either but after running

%sh
sudo apt update
sudo apt install -y cmake libgdal-dev

we were able to run %pip install databricks-mosaic followed by mos.setup_gdal(). Now we're up and running on DBR 13.3.

Thanks again.