Write helpers to categorize data

maurolepore commented 6 years ago

From @maurolepore on August 31, 2017 16:19

Gabriel proposed to develop a friendly way to categorize (cut) numeric variables. Important ones include:

from dbh to size category.
from gx and gy to quadrat: see add_quad().

suppressPackageStartupMessages(library(fgeo))
x <- tibble::tibble(gx = c(0, 50, 999.9, 1000), gy = gx/2)
add_quad(x)
#> Gessing: plotdim = c(1000, 500)
#>   * If guess is wrong, provide the correct argument `plotdim`
#> # A tibble: 4 x 3
#>      gx    gy quad 
#>   <dbl> <dbl> <chr>
#> 1    0     0  0101 
#> 2   50    25  0302 
#> 3 1000.  500. 5025 
#> 4 1000   500  NANA
add_quad(x, start = 0)
#> Gessing: plotdim = c(1000, 500)
#>   * If guess is wrong, provide the correct argument `plotdim`
#> # A tibble: 4 x 3
#>      gx    gy quad 
#>   <dbl> <dbl> <chr>
#> 1    0     0  0000 
#> 2   50    25  0201 
#> 3 1000.  500. 4924 
#> 4 1000   500  NANA

Created on 2018-07-02 by the reprex package (v0.2.0).

Copied from original issue: forestgeo/fgeo.abundance#46

maurolepore commented 6 years ago

All that follow is by @gabrielareto (via @).

we need, in general, code to go from quantitative variables to qualitative variables, to allow flexible splitting of the data. This is code to go from dbh to size class.

from_dbh_to_dbh_class <- function(dbh, k, breaks = NULL, ...)
{
  if(is.null(breaks)) breaks <- exp(c(seq(from = log(10), to = 7, length.out = k)))
  cut(dbh, breaks = breaks, ...)
}

it assumes minimum dbh of 10 (mm) and that everything above 1 m (~ exp(7) mm) is the largest size class. It assumes size classes evenly distributed on the log scale. I think it makes sense for trees, in our plots. In some plots with small trees, maybe exp(6) is a more appropriate lower bound for the largest size class. An extra line could merge two size classes if the largest size class is "too empty".

we need something similar to calculate the quadrat for every individual, at any grid size. This code calculates a boolean matrix of individuals by quadrats. It's vectorized, but maybe that matrix is too big for a full dataset and slow because of memory issues. Think about it.

gridsize = 20
g <- expand.grid(seq(0, 1000 - gridsize, by = gridsize), seq(0, 500 - gridsize, by = gridsize))

boo1 <- outer(gx, g[,1], '>=')
boo2 <- outer(gx, g[,1] + gridsize, '<')
boo3 <- outer(gy, g[,2], '>=')
boo4 <- outer(gy, g[,2] + gridsize, '<')
boo <- boo1 & boo2 & boo3 & boo4

the final function should return quadrat codes with consistent names. I suggest x0_y0 format, using coordinates in meters of the bottomleft corner of each quadrat, using 4 digits always because some plots have >1km but all have <10km length. A utility function refill_zeros <- function(x, nchar = 4) should be useful.

it may return also the lx, ly coordinates, lx = gx %% gridsize.

these quantity-to-category functions could either return the categories, or append new columns to the full dataset, returning a larger full table. In any case, all of them should behave similarly.

best regards,

Gabriel

maurolepore commented 6 years ago

From @gabrielareto on August 31, 2017 21:5

vectorized function to add zeros to each position in character vectors

add_zeroes <- add_zeros <- function(x, min.n.char = 4) { n.0s.missing <- pmax(0, min.n.char - nchar(x)) paste0(strrep("0", n.0s.missing), x) }

note it doesn't work with a matrix, sanity checks required

This is already implemented in `stringr::str_pad()`

add_zeroes <- add_zeros <- function(x, min.n.char = 4) {
  n.0s.missing <- pmax(0, min.n.char - nchar(x))
  paste0(base::strrep("0", n.0s.missing), x)
}
add_zeroes(1)
#> [1] "0001"

stringr::str_pad(1, width = 4, pad = 0)
#> [1] "0001"

stringr::str_pad(1, width = 4, pad = "k")
#> [1] "kkk1"

Created on 2018-12-30 by the reprex package (v0.2.1)

maurolepore commented 6 years ago

From @gabrielareto on August 31, 2017 21:46

individuals_into_quadrats <- function(x, gridsize = 20, xmax, ymax)
{
  # Some grid sizes are not common multiples of plot dimensions
  bad <- which(!(xmax %% gridsize == 0 & ymax %% gridsize == 0))
  if(length(bad) > 0) {
    warning(paste0("These grid sizes won't be used:", paste(gridsize[bad], collapse = ",")))
    gridsize <- gridsize[-bad]
  }

  # Make many grids and calculate quadrat codes in left_bott format
  q <- sapply(gridsize, function(gs){
    g <- expand.grid(seq(0, xmax - gs, by = gs), seq(0, ymax - gs, by = gs))
    boo1 <- outer(x$gx, g[,1], '>=')
    boo2 <- outer(x$gx, g[,1] + gs, '<')
    boo3 <- outer(x$gy, g[,2], '>=')
    boo4 <- outer(x$gy, g[,2] + gs, '<')
    boo <- boo1 & boo2 & boo3 & boo4
    w <- apply(boo, 1, which)
    Q1 <- add_zeros(g[w, 1], min.n.char = 4)
    Q2 <- add_zeros(g[w, 2], min.n.char = 4)
    paste0(Q1, "_", Q2)
  })
  colnames(q) <- paste0("Q", gridsize)

  # Calculate local coordinates for each grid --- (yes, this could/should be a different function)
  lx <- outer(x$gx, gridsize, '%%')
  ly <- outer(x$gy, gridsize, '%%')
  colnames(lx) <- paste0(colnames(q), "_lx")
  colnames(ly) <- paste0(colnames(q), "_ly")
  local.coor <- round(cbind(lx, ly), 1)
  # (sort columns in x,y alternance)
  local.coor <- local.coor[,c(seq(1, ncol(local.coor), by = 2), seq(2, ncol(local.coor), by = 2))]

  # Join and return
  data.frame(q, local.coor)
}

qq <- individuals_into_quadrats(bci12t3mini, gridsize = c(5, 10, 20, 100, 3.3832893274), xmax = 1000, ymax = 500) Warning message: In individuals_into_quadrats(bci12t3mini, gridsize = c(5, 10, 20, : These grid sizes won't be used:3.3832893274 head(qq) Q5 Q10 Q20 Q100 Q5_lx Q20_lx Q5_ly Q20_ly Q10_lx Q100_lx Q10_ly Q100_ly 1 0895_0040 0890_0040 0880_0040 0800_0000 3.7 18.7 2.0 2.0 8.7 98.7 2.0 42.0 2 0870_0280 0870_0280 0860_0280 0800_0200 3.2 13.2 4.2 4.2 3.2 73.2 4.2 84.2 3 0865_0160 0860_0160 0860_0160 0800_0100 1.6 6.6 3.4 3.4 6.6 66.6 3.4 63.4 4 0740_0300 0740_0300 0740_0300 0700_0300 4.0 4.0 4.9 4.9 4.0 44.0 4.9 4.9 5 0720_0445 0720_0440 0720_0440 0700_0400 3.8 3.8 1.9 6.9 3.8 23.8 6.9 46.9 6 0615_0435 0610_0430 0600_0420 0600_0400 1.6 16.6 0.8 15.8 6.6 16.6 5.8 35.8

maurolepore commented 6 years ago

From @gabrielareto on August 31, 2017 22:0

we need to decide if we want functions that take any input and return any output, or functions that take the table as input and an expanded table as output. The first is more general (anyone could use for their datasets), the second includes a step that we think necessary in our workflow.

maybe arguments should be passed as

f(census.data = NULL, gx = NULL, gy = NULL, etc)

if the census.data is the input, the output could be the expanded census.data. If not, just return what is new.

Example with the dbh to size class function:

from_dbh_to_dbh_class <- function(census.data = NULL, dbh = NULL, k, breaks = NULL, ...)
{
  if(!is.null(census.data)) dbh <- census.data$dbh
  if(is.null(breaks)) breaks <- exp(c(seq(from = log(10), to = 7, length.out = k)))
  out <- cut(dbh, breaks = breaks, ...)
  if(!is.null(census.data)) out <- data.frame(census.data, out)
  out
}

That's an important trade off. Whenever possible, I build general functions. But often times the interface is simplest if the input is ForestGEO-like data. The two kind of functions are tagged with a #' @family. That way it's relatively easy to find general or specific functions.

maurolepore commented 6 years ago

From @gabrielareto on September 12, 2017 21:46

This function reads a vector of quadrat ID's in the x0_y0 format (explained above) and returns the "traditional" Q20 code in 'column row' format. This should be useful only to explore data in other formats, from some habitat maps, etc, but the forestr should not rely on this naming convention.

It allows any separation character, and declare what is the first column or row (0 or 1).

from_Q20_x0y0_format_to_colrow_format <- function(x, first = 0, sep = "")
{
    first = as.numeric(first)
    if(length(first) != 1) stop("'first' must be of length one")
    if(!first %in% c(0, 1)) warning("It is very weird that the first column or row is different than 00 or 01...")

    col.row.codes <- sapply(strsplit(as.character(x), split = "_"), function(y) {
        out = NA
        y <- as.numeric(y)
        if(all(y %% 20 == 0))
        {
            y <- y/20
            if(first == 1) y <- y + 1
            y <- add_zeros(y, min.n.char = 2)
            out <- paste(y, collapse = sep)
        }
        return(out)
    })
    if(all(is.na(col.row.codes))) stop("The bottom-left corner coordinates are not multiple of 20 m")
    return(col.row.codes)
}

Something similar is already implemented

# Similar
library(fgeo)
#> -- Attaching packages ------------------------------------------------------------------- fgeo 0.0.0.9002 --
#> v fgeo.x       0.0.0.9000     v fgeo.analyze 0.0.0.9003
#> v fgeo.plot    0.0.0.9402     v fgeo.tool    0.0.0.9005
#> -- Conflicts --------------------------------------------------------------------------- fgeo_conflicts() --
#> x fgeo.tool::filter() masks stats::filter()

census <- tibble(
  gx = c(0, 20, 40), 
  gy = c(0, 20, 40)
)

add_col_row(census, gridsize = 20, plotdim = c(100, 100))
#> # A tibble: 3 x 4
#>      gx    gy col   row  
#>   <dbl> <dbl> <chr> <chr>
#> 1     0     0 01    01   
#> 2    20    20 02    02   
#> 3    40    40 03    03

Created on 2018-12-30 by the reprex package (v0.2.1)

forestgeo / fgeo.tool