ashokkrish / spatialEpisim

spatialEpisim: Spatial Tracking of Infectious Disease Epidemics using Mathematical Models
GNU General Public License v3.0
8 stars 5 forks source link

Refactor the SVEIRD w/ Bayesian data assimilation functions πŸ—πŸ₯”πŸ₯”πŸ₯”πŸ½οΈ #36

Closed bryce-carson closed 1 month ago

bryce-carson commented 1 month ago

Given the SVEIRD simulation, with its Bayesian data analysis, is the meat and potatoes of this application, its code needs to be readable, maintainable, and performant.

During the above process, the function must be continually validated to produce the same scientific results. A testing framework and unit tests would be ideal, given the wide range of possible inputs, but I'm unsure I have time to integrate that into the development workflow.

Delayed or moved to other issues

bryce-carson commented 1 month ago

I'm working on validating the refactoring of this file.

bryce-carson commented 1 month ago

Part of the refactoring has stalled at resolving a semantic error I introduced when I optimized the wtd_nbrs_sum function. To ease discussion, I won't refer to this function by its old name, given I renamed its identifier: transmissionLikelihoodWeightings. This function returns a matrix.

@ashokkrish, please provide your input on the following two functions and their documentation.

library(tidyverse)

##' @title Average Euclidean Distance
##' @description Measure the exposure influence upon susceptible individuals at
##'   spatial position (x, y) of the infectious individuals at a spatial
##'   position (u, v).
##' @details The mathematical modelling of human or non-human mobility patterns
##'   is non-trivial. This function is a limited implementation of the effective
##'   area for only human mobility, with distances travelled per day (Ξ»)
##'   measured in kilometers.
##'
##'   NOTE: Our weight function does not take the same arguments as shown in the
##'   slideshow: (𝑀 x y u v); The term β€œeffective area” comes from the fact that
##'   if the kernel is constant on a finite disk (and zero outside it), then the
##'   formula due to Bolker (1999) gives the area of the disk.
##'
##'   See the article titled *A clarification of transmission terms in
##'   host-microparasite models by Begon et al.* (2002).
##'
##'   The raster data of infection counts or disease incidence provided to the
##'   function which calls this one, transmissionLikelihoodWeightings, may be
##'   aggregated by a given factor. That factor must be passed to this function
##'   for parity, so the data is treated the same.
##' @param radius TODO: a fixed radius r > lambda; see details.
##' @param lambda movemenet distance (in kilometers) per day; see details.
##' @param aggregationFactor the degree of aggregation applied to the raster
##'   data mentioned in the function details.
##' @returns a matrix of the average Euclidean distances
##' @author Bryce Carson
##' @author Thomas White
avgEuclideanDistance <- function(radius, lambda, aggregationFactor) {
  ## NOTE: β€œI have a philosophical and geometric question about these
  ## identities: what do they imply about the dimension and magnitude of the
  ## raster which the weighted number sum will be used with? Can we rid the
  ## function of the radius argument and only accept the lambda argument, given
  ## these identities and the claim made about the calculation of radius in the
  ## slides?” β€” Bryce
  if (radius > lambda)
    stopifnot(r == round((lambda - aggregationFactor) / aggregationFactor) + 1)
  else
    stopifnot(r == lambda + aggregationFactor)

  len <- seq_len(1 + radius * 2)
  df <- tidyr::expand(tibble(i = len, j = len), i, j)
  avg.euc.dist <- function(i, j) {
    exp(-sqrt(sum((c(i, j) - c(radius + 1, radius + 1))^2)) / lambda)
  }
  mutate(df, averageEuclideanDistance = map2_dbl(i, j, avg.euc.dist)) %>%
    as.matrix(byrow = TRUE, ncol = nrow(df))
}

##' @title Weighted Sums
##' @description Calculate a matrix of weights respecting human mobility
##'   patterns.
##' @details The pattern of human mobility used is described in a slideshow
##'   here:
##'   https://docs.google.com/presentation/d/1_gqcEh4d8yRy22tCZkU0MbGYSsGu3Djh/edit?usp=sharing&ouid=102231457806738400087&rtpof=true&sd=true.
##' @param infections a matrix of the count of infections per aggregate area in
##'   a raster of terrestrial data.
##' @param radius a constant; see details.
##' @param lambda movemenet distance (in kilometers) per day; see details.
##' @returns a matrix of weightings for the calculation of the proportion of
##'   exposed individuals who will become infectious.
##' @author Bryce Carson
##' @author Thomas White
##' @examples
##' terra::as.matrix(Infected, wide = TRUE) %>%
##'   transmissionLikelihoodWeightings(30, 15, 10)
transmissionLikelihoodWeightings <-
  function(infections, radius, lambda, aggregationFactor) {
    inputMatrixRows <- nrow(infections)
    inputMatrixCols <- ncol(infections)
    diameter <- 2 * radius
    temp.1 <- matrix(data = 0, nrow = inputMatrixRows, ncol = radius)
    temp.2 <- matrix(data = 0, nrow = radius, ncol = diameter + inputMatrixCols)
    input.modified <- rbind(temp.2, cbind(temp.1, infections, temp.1), temp.2)

    print(dim(input.modified))

    weights <- avgEuclideanDistance(radius, lambda)
    list(i = seq_len(length.out = inputMatrixRows),
         j = seq_len(length.out = inputMatrixCols)) %>%
      expand.grid() %>%
      as_tibble() %>%
      mutate(weightedSum =
               map2_dbl(i, j, function(i, j) {
                 neighbours <- input.modified[i:(i + diameter), j:(j + diameter)]

                 ## MAYBE FIXME: are we certain we don't want matrix
                 ## multiplication? %*%
                 ## FIXME: non-conformable arrays
                 ## Caused by error in `map2_dbl()`:
                 ##                    β„Ή In index: 1.
                 ## Caused by error in `subset * weights`:
                 ##                    ! non-conformable arrays
                 ## matrix(1, 3, 3) * matrix(2, 9, 3)
                 ## Error in matrix(1, 3, 3) * matrix(2, 9, 3) : non-conformable arrays
                 products <- neighbours * weights
                 sum(products)
               })) %>%
      as.matrix(byrow = TRUE, nrow = inputMatrixRows, ncol = inputMatrixCols)
  }

This is an important function, as it represents a large portion of the clock time when a simulation runs.

bryce-carson commented 1 month ago

Before

[bryce@fedora]~/src/r/spatialEpisim/R% scc --no-cocomo ~/src/r/spatialEpisim.stable/R/rasterSimulation_DA.R            
───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
R                            1       935      179       452      304         17
───────────────────────────────────────────────────────────────────────────────
Total                        1       935      179       452      304         17
───────────────────────────────────────────────────────────────────────────────
Processed 37494 bytes, 0.037 megabytes (SI)
───────────────────────────────────────────────────────────────────────────────

After

[bryce@fedora]~/src/r/spatialEpisim/R% scc --no-cocomo sveird.R
───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
R                            1       353       54        97      202         10
───────────────────────────────────────────────────────────────────────────────
Total                        1       353       54        97      202         10
───────────────────────────────────────────────────────────────────────────────
Processed 14295 bytes, 0.014 megabytes (SI)
───────────────────────────────────────────────────────────────────────────────
bryce-carson commented 1 month ago

Part of the refactoring has stalled at resolving a semantic error I introduced when I optimized the wtd_nbrs_sum function. To ease discussion, I won't refer to this function by its old name, given I renamed its identifier: transmissionLikelihoodWeightings. This function returns a matrix.

@ashokkrish, please provide your input on the following two functions and their documentation.

library(tidyverse)

##' @title Average Euclidean Distance
##' @description Measure the exposure influence upon susceptible individuals at
##'   spatial position (x, y) of the infectious individuals at a spatial
##'   position (u, v).
##' @details The mathematical modelling of human or non-human mobility patterns
##'   is non-trivial. This function is a limited implementation of the effective
##'   area for only human mobility, with distances travelled per day (Ξ»)
##'   measured in kilometers.
##'
##'   NOTE: Our weight function does not take the same arguments as shown in the
##'   slideshow: (𝑀 x y u v); The term β€œeffective area” comes from the fact that
##'   if the kernel is constant on a finite disk (and zero outside it), then the
##'   formula due to Bolker (1999) gives the area of the disk.
##'
##'   See the article titled *A clarification of transmission terms in
##'   host-microparasite models by Begon et al.* (2002).
##'
##'   The raster data of infection counts or disease incidence provided to the
##'   function which calls this one, transmissionLikelihoodWeightings, may be
##'   aggregated by a given factor. That factor must be passed to this function
##'   for parity, so the data is treated the same.
##' @param radius TODO: a fixed radius r > lambda; see details.
##' @param lambda movemenet distance (in kilometers) per day; see details.
##' @param aggregationFactor the degree of aggregation applied to the raster
##'   data mentioned in the function details.
##' @returns a matrix of the average Euclidean distances
##' @author Bryce Carson
##' @author Thomas White
avgEuclideanDistance <- function(radius, lambda, aggregationFactor) {
  ## NOTE: β€œI have a philosophical and geometric question about these
  ## identities: what do they imply about the dimension and magnitude of the
  ## raster which the weighted number sum will be used with? Can we rid the
  ## function of the radius argument and only accept the lambda argument, given
  ## these identities and the claim made about the calculation of radius in the
  ## slides?” β€” Bryce
  if (radius > lambda)
    stopifnot(r == round((lambda - aggregationFactor) / aggregationFactor) + 1)
  else
    stopifnot(r == lambda + aggregationFactor)

  len <- seq_len(1 + radius * 2)
  df <- tidyr::expand(tibble(i = len, j = len), i, j)
  avg.euc.dist <- function(i, j) {
    exp(-sqrt(sum((c(i, j) - c(radius + 1, radius + 1))^2)) / lambda)
  }
  mutate(df, averageEuclideanDistance = map2_dbl(i, j, avg.euc.dist)) %>%
    as.matrix(byrow = TRUE, ncol = nrow(df))
}

##' @title Weighted Sums
##' @description Calculate a matrix of weights respecting human mobility
##'   patterns.
##' @details The pattern of human mobility used is described in a slideshow
##'   here:
##'   https://docs.google.com/presentation/d/1_gqcEh4d8yRy22tCZkU0MbGYSsGu3Djh/edit?usp=sharing&ouid=102231457806738400087&rtpof=true&sd=true.
##' @param infections a matrix of the count of infections per aggregate area in
##'   a raster of terrestrial data.
##' @param radius a constant; see details.
##' @param lambda movemenet distance (in kilometers) per day; see details.
##' @returns a matrix of weightings for the calculation of the proportion of
##'   exposed individuals who will become infectious.
##' @author Bryce Carson
##' @author Thomas White
##' @examples
##' terra::as.matrix(Infected, wide = TRUE) %>%
##'   transmissionLikelihoodWeightings(30, 15, 10)
transmissionLikelihoodWeightings <-
  function(infections, radius, lambda, aggregationFactor) {
    inputMatrixRows <- nrow(infections)
    inputMatrixCols <- ncol(infections)
    diameter <- 2 * radius
    temp.1 <- matrix(data = 0, nrow = inputMatrixRows, ncol = radius)
    temp.2 <- matrix(data = 0, nrow = radius, ncol = diameter + inputMatrixCols)
    input.modified <- rbind(temp.2, cbind(temp.1, infections, temp.1), temp.2)

    print(dim(input.modified))

    weights <- avgEuclideanDistance(radius, lambda)
    list(i = seq_len(length.out = inputMatrixRows),
         j = seq_len(length.out = inputMatrixCols)) %>%
      expand.grid() %>%
      as_tibble() %>%
      mutate(weightedSum =
               map2_dbl(i, j, function(i, j) {
                 neighbours <- input.modified[i:(i + diameter), j:(j + diameter)]

                 ## MAYBE FIXME: are we certain we don't want matrix
                 ## multiplication? %*%
                 ## FIXME: non-conformable arrays
                 ## Caused by error in `map2_dbl()`:
                 ##                    β„Ή In index: 1.
                 ## Caused by error in `subset * weights`:
                 ##                    ! non-conformable arrays
                 ## matrix(1, 3, 3) * matrix(2, 9, 3)
                 ## Error in matrix(1, 3, 3) * matrix(2, 9, 3) : non-conformable arrays
                 products <- neighbours * weights
                 sum(products)
               })) %>%
      as.matrix(byrow = TRUE, nrow = inputMatrixRows, ncol = inputMatrixCols)
  }

This is an important function, as it represents a large portion of the clock time when a simulation runs.

@ashokkrish, I have resolved the issue with this entirely. The code is much simpler now, because I'm using terra::focal and not handling the iteration myself now.

bryce-carson commented 1 month ago

At the moment the following error occurs during the call of the linearInterpolationOperator function, but only when called by [the meat and potatoes simulation function](https://www.merriam-webster.com/dictionary/meat-and-potatoes) (SVEIRD.BayesianDataAssimilation).

Error in eval(substitute(expr), data, enclos = parent.frame()) : 
  numeric 'envir' arg not of length one
In addition: Warning messages:
1: In linearInterpolationOperator(layers, healthZoneCoordinates, compartmentsReported) :
  Duplicate cell indices in cells vector derived from health zone coordinates.
2: In linearInterpolationOperator(layers, healthZoneCoordinates, compartmentsReported) :
  Ignoring NAs generated from healthZoneCoordinates during linearInterpolationOperator generation.

Things I've done to resolve it:

bryce-carson commented 1 month ago
Error in eval(substitute(expr), data, enclos = parent.frame()) : 
  numeric 'envir' arg not of length one
In addition: Warning messages:
1: In linearInterpolationOperator(layers, healthZoneCoordinates, compartmentsReported) :
  Duplicate cell indices in cells vector derived from health zone coordinates.
2: In linearInterpolationOperator(layers, healthZoneCoordinates, compartmentsReported) :
  Ignoring NAs generated from healthZoneCoordinates during linearInterpolationOperator generation.

Resolved

bryce-carson commented 1 month ago

I've completed a refactoring of the function. The entire function could run without error a couple commits ago; I am now resolving bugs which cause the function to produce wildly incorrect results.