Code repository for the PLOS ONE article ‘Evaluating spatially enabled machine learning approaches for depth to bedrock mapping’. https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0296881
The depth to bedrock prediction model for Alberta uses an R project workflow based on the tidyverse and tidymodels suite of packages. The workflow is re-executed via the following steps:
Data preparation (downloading publicly available remote sensing datasets and terrain analysis using the Rsagacmd package).
Water well litholog augmentation by classifying lithological descriptions into surficial and bedrock units based on a statistical natural language processing approach and text pattern matching.
Model evaluation and selection based on cross validation and quality of the predicted DTB maps in several physiographically varying sub-regions.
Final DTB prediction.
This R project uses a reproducible environment based on the renv
package. This
environment uses R package versions that were installed under Ubuntu
22.04 and is not guaranteed to cleanly install on other operating
systems, particularly Microsoft Windows. To install the required
packages, run the following code in the R console. The packages and
versions that will be installed are specified in the renv.lock
file.
renv::restore()
This project has an external dependency - SAGA-GIS (>= 7.3) where the
saga_cmd
binary needs to be available in the system path. The
Rsagacmd package package
is used to run the SAGA-GIS algorithms from R. SAGA-GIS can be installed
in Ubuntu 22.04 using the following commands:
sudo apt-get install saga
For Windows users, the installation has to be performed manually by downloading the SAGA-GIS binary from Sourceforge and then adding the path to the SAGA installation directory to PATH. A more complete install of QGIS including SAGA will also get you the saga_cmd.exe executable. For example:/ C:\Program Files\QGIS 3.28.12\apps\saga\saga_cmd.exe.
In addition, the project automatically downloads the required remote
sensing datasets. To download the MODIS data, a free NASA Earthdata
login is required. The login credentials need to be set in a .Renviron
file (a plain text file with no extension) in the project root directory
as ‘EARTHDATA_USER’=
Sys.setenv("EARTHDATA_USER" = "username")
Sys.setenv("EARTHDATA_KEY" = "password")
For the statistical natural language model prediction, XGBoost using GPU
is used. This requires a CUDA (Compute Unified Device Architecture)
enabled GPU and the CUDA toolkit to be installed. Alternatively, the
model can be trained on CPU by changing tree_method
= ‘hist’ in the
02-nlp.qmd
script, but this will result in a significant increase in
training time.
The folder structure is organized as follows:
/
: R scripts used to prepare data and run the model.R
: Functions used by the scripts.projdata
: Data required by the model and visualizations.outputs
: Model results (created by the scripts).models
: Trained models (created by the scripts).data/raw
: Raw data downloaded from remote sources (created by the
scripts).data/processed
: Processed data used in the model (created by the
scripts).The data includes the following:
Functions used by the scripts are stored in the R
folder. The scripts
are organized into the following files:
01-grids.qmd
: Quarto notebook to download and prepare remote sensing
data and terrain analysis.02-nlp.qmd
: Quarto notebook to prepare lithological descriptions for
machine learning.03-training-data.R
: Prepare the training dataset that is used in the
depth to bedrock machine learning model, i.e., prepare the feature
matrix.04-experiments.R
: Model evaluation and selection.05-idw.R
: IDW interpolation of bedrock elevation.06-kriging.R
: Kriging interpolation of DTB.07-provincial-model.R
: Final DTB prediction model.08-analysis-results.qmd
: Generate figures and tables that are used
in the published paper.Once these scripts are run, additional directories and model outputs will be created within the project directory. The final directory structure will look like:
fs::dir_tree(recurse = 2)
#> .
#> ├── LICENSE
#> ├── 01-grids.qmd
#> ├── 02-nlp.qmd
#> ├── 03-training-data.R
#> ├── 04-experiments.R
#> ├── 05-idw.R
#> ├── 06-kriging.R
#> ├── 07-provincial-model.R
#> ├── 08-analysis-results.qmd
#> ├── R
#> │ ├── dtb.R
#> │ ├── nlp.R
#> │ ├── plots.R
#> │ ├── predictors.R
#> │ └── resampling.R
#> ├── plos-one.csl
#> ├── plos2015.bst
#> ├── zotero.bib
#> ├── README.Rmd
#> ├── README.md
#> ├── _dependencies.R
#> ├── data
#> │ ├── processed
#> │ │ ├── picks-nlp.rds
#> │ │ ├── predictors.tif
#> │ │ └── training-data.rds
#> │ └── raw
#> │ ├── alos-dem.tif
#> │ └── modis.tif
#> ├── depth-to-bedrock-plos-one.Rproj
#> ├── models
#> │ ├── cross-validation-dtb-prov.rds
#> │ ├── cross-validation-idw-prov.rds
#> │ ├── cross-validation-kriging-prov.rds
#> │ ├── experiments-cross-validation.rds
#> │ ├── experiments-dtb.rds
#> │ ├── experiments-importances.rds
#> │ ├── experiments-models.rds
#> │ ├── model-dtb-prov.rds
#> │ ├── model-nlp.rds
#> │ └── resamples-nlp.rds
#> ├── outputs
#> │ ├── dtb-rf-prov-pred-int.tif
#> │ ├── dtb-rf-prov.tif
#> │ ├── idw-prov.tif
#> │ ├── kriging-prov.tif
#> │ ├── picked-combined.gpkg
#> │ ├── picks-nlp-cv.csv
#> │ ├── picks-nlp.csv
#> │ └── picks.csv
#> ├── projdata
#> │ ├── cross-section-line-rainbow-lake.geojson
#> │ ├── cross-section-line-wcab.geojson
#> │ ├── lithologs.rds
#> │ ├── physio-pettapiece.gpkg
#> │ └── picks.rds
#> ├── renv
#> │ ├── activate.R
#> │ ├── library
#> │ │ └── R-4.3
#> │ ├── settings.json
#> │ └── staging
#> └── renv.lock