Conte-Ecology / conteStreamTemperature

Package for cleaning and analyzing stream daily stream temperature
MIT License
1 stars 1 forks source link

Add Derived Metric: Consecutive Days Exceedance #29

Closed djhocking closed 9 years ago

djhocking commented 9 years ago

Another thing we talked about in the Maine group meeting was to add another set of derived metrics that represented runs of consecutive days where the stream temperature exceeded some threshold. Maybe something like mean and max number of consecutive days per year where temp > 22 deg C, since that's closely tied to fish stress.

djhocking commented 9 years ago

I can't figure out how to do this without resorting to a for-loop and I don't want to be looping through billions of records in R, even in parallel. I could do it by created lagged columns but would be limited to 5-10 lagged columns before it became too unwieldy. Depending on what they want, this might be a job for Rcpp unless someone has a better idea.

walkerjeffd commented 9 years ago

Look into run length encoding with the rle() function. It basically takes a numeric vector and returns the sequence and length of each "run" where a run is a consecutive sequence of identical values. So you could do something like:

run_22 <- (temp > 22)
rle_22 <- rle(run_22)
lengths_22 <- rle_22$lengths[rle_22$values]

where lengths_22 will be a vector of the run lengths where rle_22==TRUE, which in turn are the lengths of consecutive days where temp > 22. So max(lengths_22) would be the length of longest consecutive run of temp > 22.

bletcher commented 9 years ago

That's excellent. Should be easy to embed into dplyr.

On Tue, Jun 9, 2015 at 3:25 PM, Jeff Walker notifications@github.com wrote:

Look into run length encoding with the rle() function. It basically takes a numeric vector and returns the sequence and length of each "run" where a run is a consecutive sequence of identical values. So you could do something like:

run_22 <- (temp > 22)rle_22 <- rle(run_22)lengths_22 <- rle_22$lengths[rle_22$values]

where lengths_22 will be a vector of the run lengths where rle_22==TRUE, which in turn are the lengths of consecutive days where temp > 22. So max(lengths_22) would be the length of longest consecutive run of temp > 22.

— Reply to this email directly or view it on GitHub https://github.com/Conte-Ecology/conteStreamTemperature/issues/29#issuecomment-110475258 .

Silvio O. Conte Anadromous Fish Research Center, U.S. Geological Survey P.O. Box 796 -- One Migratory Way Turners Falls, MA 01376 (413) 863-3803 Cell: (413) 522-9417 FAX (413) 863-9810

ben_letcher@usgs.gov bletcher@eco.umass.edu http://www.lsc.usgs.gov/?q=cafb-research

djhocking commented 9 years ago

Thanks Jeff. I had never heard of run length encoding before. Definitely what I'm looking for. I'm still trying to figure out how to do it by featureid and year. I have a grouped tbl by featureid and year but it doesn't fit nicely into a dplyr structure. I'll have to give it more thought to keep the featureid and years associated with the rle outputs.

djhocking commented 9 years ago

Tried a couple things:

consecExceed <- byfeatureidYear %>%
  dplyr::mutate(month = as.numeric(format(date, "%m"))) %>%
  dplyr::filter(month >= 6 & month <= 8) %>%
  dplyr::mutate(exceed = (tempPredicted > temp.threshold),
                consecExceed = rep(rle(exceed)$lengths, rle(exceed)$lengths))
# doesn't work: invalid subscript type 'closure'

so tried outside dplyr

consecExceed <- byfeatureidYear %>%
  dplyr::mutate(month = as.numeric(format(date, "%m"))) %>%
  dplyr::filter(month >= 6 & month <= 8) %>%
  dplyr::mutate(exceed = (tempPredicted > temp.threshold))

consecExceed$consecExceed <- rep(rle(consecExceed$exceed)$lengths, rle(consecExceed$exceed)$lengths)

It "works" but doesn't separate by featureid and year. I think a potential solution is to try adding ifelse statements based on lagged featureid and year being equal to the current featureid and year combo (equivalent to 1if(id_year[i] == id_year[i-1])`.

djhocking commented 9 years ago

After some searching based on the error I found a simple solution (http://stackoverflow.com/questions/23820491/dplyr-error-object-not-found-using-rle-in-mutate). It seems like a small bug in dplyr that is corrected in the development version but there's a current workaround. Basically convert the dataframe/tbl from tbl_df to tbl_dt and the mutate function works.

consecExceed <- tbl_dt(byfeatureidYear) %>%
  dplyr::group_by(featureid, year) %>%
  dplyr::mutate(month = as.numeric(format(date, "%m"))) %>%
  dplyr::filter(month >= 6 & month <= 8) %>%
  dplyr::mutate(exceed = (tempPredicted > temp.threshold),
                consecExceed = rep(rle(exceed)$lengths, rle(exceed)$lengths))
walkerjeffd commented 9 years ago

I was thinking something along the lines of this:

x <- data.frame(featureid=rep(c("A", "B"), each=90),
                year=rep(rep(c(2010, 2011, 2012), each=30), times=2),
                x=runif(180))

get_run <- function(y, thresh=0.5) {
  runs_ <- y$x > thresh
  rle_ <- rle(runs_)
  lens_ <- rle_$lengths[rle_$values]
  d <- data.frame(max_run=max(lens_), mean_run=mean(lens_))
  names(d) <- paste(names(d), '_', thresh, sep='')
  d
}

# get max/mean run of each featureid, year
group_by(x, featureid, year) %>%
  do(get_run(., thresh=0.6))

Note I'm using a bunch of random uniform values, so threshold is between 0 and 1, but same idea.

With your consecExceed example, wouldn't the consecExceed column contain runs of each length where the value in each run is also the length (e.g. [3, 3, 3, 4, 4, 4, 4, 2, 2, ...])? Then if you compute the mean of that you're not actually computing the mean length as each length is weighted by, well, its length.

In other words if the lengths of three runs are 3, 4, 5, then mean(c(3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 5)) != 4. For max run length though, it doesn't matter.

Also, quick tip: use month = lubridate::month(date) instead of as.numeric(format(date, "%m")) and save your sanity

djhocking commented 9 years ago

Oh man, I was going to have to do some clunky like filter to event_start then do the mean. This is much nicer. I've never used the do() function before.

I just started using lubridate and it's really nice. I haven't gone back to the legacy code to update it (the start of this function was copied from another function).

djhocking commented 9 years ago

Strange, it works on your test set but I'm having trouble with the real data.

calcConsecExceed <- function(y, threshold = 22) {
  exceed <- y$tempPredicted > threshold
  consec <- rle(exceed)
  consecExceed <- consec$lengths[consec$values]
  d <- data.frame(maxConsec = max(consecExceed, na.rm = T), meanConsec = mean(consecExceed, na.rm = T))
  names(d) <- paste(names(d), '_', threshold, sep='')
  d
}

# get max/mean run of each featureid, year
test <- group_by(tempDataSyncS, featureid, year) %>%
  do(calcConsecExceed(., threshold = 22))

I get the following:

|====================================================|100% ~0 s remaining     
There were 50 or more warnings (use warnings() to see the first 50)
> warnings()
Warning messages:
1: In max(consecExceed, na.rm = T) :
  no non-missing arguments to max; returning -Inf
2: In max(consecExceed, na.rm = T) :
  no non-missing arguments to max; returning -Inf
3: In max(consecExceed, na.rm = T) :
  no non-missing arguments to max; returning -Inf

> dim(test)
[1] 1872    4
> dim(tempDataSyncS)
[1] 228287    125

> head(test)
Source: local data frame [6 x 4]
Groups: featureid, year

  featureid year maxConsec_22 meanConsec_22
1    730482 2009            4             2
2    730531 2004         -Inf           NaN
3    730675 2009         -Inf           NaN
4    730912 2009         -Inf           NaN
5    730916 2009         -Inf           NaN
6    731705 2004         -Inf           NaN
> tail(test)
Source: local data frame [6 x 4]
Groups: featureid, year

  featureid year maxConsec_22 meanConsec_22
1    892941 2012            5      1.620690
2    892941 2013            5      1.242424
3    894592 2011            6      1.590909
4    894592 2012            6      1.666667
5    894592 2013            3      1.444444
6    895539 1999            3      1.222222
walkerjeffd commented 9 years ago

it's probably because some of the featureid/years are all NA. you'll probably have to modify the function to check for that, something like this:

calcConsecExceed <- function(y, threshold = 22) {
  if (all(is.na(y$tempPredicted))) {
    maxConsec <- NA
    meanConsec <- NA
  } else {
    exceed <- y$tempPredicted > threshold
    consec <- rle(exceed)
    consecExceed <- consec$lengths[consec$values]
    maxConsec <- max(consecExceed, na.rm = T)
    meanConsec <- mean(consecExceed, na.rm = T)
  }
  d <- data.frame(maxConsec=maxConsec, meanConsec=meanConsec)
  names(d) <- paste(names(d), '_', threshold, sep='')
  d
}

this would also be a good opportunity to write some unit tests. make some dummy runs and make sure it returns the values you think it should, and also handles NA values correctly.

djhocking commented 9 years ago

NA were my first guess as well but I check and in this data frame I didn’t have any NA. It took a while but I realized it’s trying to take means and maxes when all or none of the values in the group are above the threshold, hence producing -Inf in some cases and NaN in others and “correct" answers to many. Without a better understanding of the inner workings of these functions, I’m still not quite sure why these were produced but this definitely fixed the error.

I spent a good amount of time this morning reading about testing and the testthat package. I'd been wanting to do it so it's good that this was the incentive I needed. It's quite hard to write tests. At first I was trying to make the much too complicated but I eventually came up with some simple ones and put them in R scripts in the /inst/tests/ folder in the conteStreamTemperature package. Here's the resulting function:

calcConsecExceed <- function(y, threshold = 22) {
  if(all(is.na(y$tempPredicted))) {
    warning("all values are NA")
    maxConsec <- NA
    meanConsec <- NA
  } else if(all(y$tempPredicted <= threshold)) {
      maxConsec <- 0
      meanConsec <- 0
    } else {
      exceed <- y$tempPredicted > threshold
      consec <- rle(exceed)
      consecExceed <- consec$lengths[consec$values]
      maxConsec = max(consecExceed, na.rm = T)
      meanConsec = mean(consecExceed, na.rm = T)
    }
        d <- data.frame(maxConsec, meanConsec)
      names(d) <- paste(names(d), '_', threshold, sep='')
      return(d)
  }

I don't love the nested ifelse but it didn't seem appropriate for switch so I went with it. Here are a few tests I came up with, although there are many more I could do (single NA, class type, sort order, etc.):

# test that returns a 0 if all values <= threshold
test_that("a zero is returned if all values below threshold") {
  df <- data.frame(tempPredicted = runif(10, 2, 20))
  consecExceed <- calcConsecExceed(df, 22)
  expect_equal(consecExceed$maxConsec_22, 0)
  expect_equal(consecExceed$meanConsec_22, 0)
}

# test if all values > threshold
test_that("a zero is returned if all values below threshold") {
  df <- data.frame(tempPredicted = runif(10, 23, 30))
  consecExceed <- calcConsecExceed(df, 22)
  expect_equal(consecExceed$maxConsec_22, length(df$tempPredicted))
  expect_equal(consecExceed$meanConsec_22, length(df$tempPredicted))
}

# test if all values are NA
test_that("NA is returned when all values are NA") {
  df <- data.frame(tempPredicted = rep(NA, times = 10))
  consecExceed <- calcConsecExceed(df, 22)
  expect_true(is.na(consecExceed$maxConsec_22))
  expect_true(is.na(consecExceed$meanConsec_22))
  expect_warning(calcConsecExceed(df, 22))
}
walkerjeffd commented 9 years ago

Cool, that makes sense, I guess you can't know how long a run is if you don't really know when it started or ended.

Unit tests look good, you might add one to make sure its computing the mean and max correctly, like:

test_that("returns correct mean and max consecutive lengths") {
  df <- data.frame(tempPredicted = c(10, 10, 25, 25, 25, 10, 10, 25, 25, 10, 25))
  consecExceed <- calcConsecExceed(df, 22)
  expect_equal(consecExceed$maxConsec_22, 3)
  expect_equal(consecExceed$meanConsec_22, 2)
}
djhocking commented 9 years ago

Good idea. Thanks.