Closed djhocking closed 9 years ago
I can't figure out how to do this without resorting to a for-loop and I don't want to be looping through billions of records in R, even in parallel. I could do it by created lagged columns but would be limited to 5-10 lagged columns before it became too unwieldy. Depending on what they want, this might be a job for Rcpp unless someone has a better idea.
Look into run length encoding with the rle()
function. It basically takes a numeric vector and returns the sequence and length of each "run" where a run is a consecutive sequence of identical values. So you could do something like:
run_22 <- (temp > 22)
rle_22 <- rle(run_22)
lengths_22 <- rle_22$lengths[rle_22$values]
where lengths_22
will be a vector of the run lengths where rle_22==TRUE
, which in turn are the lengths of consecutive days where temp > 22
. So max(lengths_22)
would be the length of longest consecutive run of temp > 22
.
That's excellent. Should be easy to embed into dplyr.
On Tue, Jun 9, 2015 at 3:25 PM, Jeff Walker notifications@github.com wrote:
Look into run length encoding with the rle() function. It basically takes a numeric vector and returns the sequence and length of each "run" where a run is a consecutive sequence of identical values. So you could do something like:
run_22 <- (temp > 22)rle_22 <- rle(run_22)lengths_22 <- rle_22$lengths[rle_22$values]
where lengths_22 will be a vector of the run lengths where rle_22==TRUE, which in turn are the lengths of consecutive days where temp > 22. So max(lengths_22) would be the length of longest consecutive run of temp > 22.
— Reply to this email directly or view it on GitHub https://github.com/Conte-Ecology/conteStreamTemperature/issues/29#issuecomment-110475258 .
Silvio O. Conte Anadromous Fish Research Center, U.S. Geological Survey P.O. Box 796 -- One Migratory Way Turners Falls, MA 01376 (413) 863-3803 Cell: (413) 522-9417 FAX (413) 863-9810
ben_letcher@usgs.gov bletcher@eco.umass.edu http://www.lsc.usgs.gov/?q=cafb-research
Thanks Jeff. I had never heard of run length encoding before. Definitely what I'm looking for. I'm still trying to figure out how to do it by featureid and year. I have a grouped tbl by featureid and year but it doesn't fit nicely into a dplyr structure. I'll have to give it more thought to keep the featureid and years associated with the rle
outputs.
Tried a couple things:
consecExceed <- byfeatureidYear %>%
dplyr::mutate(month = as.numeric(format(date, "%m"))) %>%
dplyr::filter(month >= 6 & month <= 8) %>%
dplyr::mutate(exceed = (tempPredicted > temp.threshold),
consecExceed = rep(rle(exceed)$lengths, rle(exceed)$lengths))
# doesn't work: invalid subscript type 'closure'
so tried outside dplyr
consecExceed <- byfeatureidYear %>%
dplyr::mutate(month = as.numeric(format(date, "%m"))) %>%
dplyr::filter(month >= 6 & month <= 8) %>%
dplyr::mutate(exceed = (tempPredicted > temp.threshold))
consecExceed$consecExceed <- rep(rle(consecExceed$exceed)$lengths, rle(consecExceed$exceed)$lengths)
It "works" but doesn't separate by featureid and year. I think a potential solution is to try adding ifelse
statements based on lagged featureid
and year
being equal to the current featureid
and year
combo (equivalent to 1if(id_year[i] == id_year[i-1])`.
After some searching based on the error I found a simple solution (http://stackoverflow.com/questions/23820491/dplyr-error-object-not-found-using-rle-in-mutate). It seems like a small bug in dplyr that is corrected in the development version but there's a current workaround. Basically convert the dataframe/tbl from tbl_df
to tbl_dt
and the mutate
function works.
consecExceed <- tbl_dt(byfeatureidYear) %>%
dplyr::group_by(featureid, year) %>%
dplyr::mutate(month = as.numeric(format(date, "%m"))) %>%
dplyr::filter(month >= 6 & month <= 8) %>%
dplyr::mutate(exceed = (tempPredicted > temp.threshold),
consecExceed = rep(rle(exceed)$lengths, rle(exceed)$lengths))
I was thinking something along the lines of this:
x <- data.frame(featureid=rep(c("A", "B"), each=90),
year=rep(rep(c(2010, 2011, 2012), each=30), times=2),
x=runif(180))
get_run <- function(y, thresh=0.5) {
runs_ <- y$x > thresh
rle_ <- rle(runs_)
lens_ <- rle_$lengths[rle_$values]
d <- data.frame(max_run=max(lens_), mean_run=mean(lens_))
names(d) <- paste(names(d), '_', thresh, sep='')
d
}
# get max/mean run of each featureid, year
group_by(x, featureid, year) %>%
do(get_run(., thresh=0.6))
Note I'm using a bunch of random uniform values, so threshold is between 0 and 1, but same idea.
With your consecExceed
example, wouldn't the consecExceed
column contain runs of each length where the value in each run is also the length (e.g. [3, 3, 3, 4, 4, 4, 4, 2, 2, ...]
)? Then if you compute the mean of that you're not actually computing the mean length as each length is weighted by, well, its length.
In other words if the lengths of three runs are 3, 4, 5, then mean(c(3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 5)) != 4
. For max run length though, it doesn't matter.
Also, quick tip: use month = lubridate::month(date)
instead of as.numeric(format(date, "%m"))
and save your sanity
Oh man, I was going to have to do some clunky like filter to event_start
then do the mean. This is much nicer. I've never used the do()
function before.
I just started using lubridate
and it's really nice. I haven't gone back to the legacy code to update it (the start of this function was copied from another function).
Strange, it works on your test set but I'm having trouble with the real data.
calcConsecExceed <- function(y, threshold = 22) {
exceed <- y$tempPredicted > threshold
consec <- rle(exceed)
consecExceed <- consec$lengths[consec$values]
d <- data.frame(maxConsec = max(consecExceed, na.rm = T), meanConsec = mean(consecExceed, na.rm = T))
names(d) <- paste(names(d), '_', threshold, sep='')
d
}
# get max/mean run of each featureid, year
test <- group_by(tempDataSyncS, featureid, year) %>%
do(calcConsecExceed(., threshold = 22))
I get the following:
|====================================================|100% ~0 s remaining
There were 50 or more warnings (use warnings() to see the first 50)
> warnings()
Warning messages:
1: In max(consecExceed, na.rm = T) :
no non-missing arguments to max; returning -Inf
2: In max(consecExceed, na.rm = T) :
no non-missing arguments to max; returning -Inf
3: In max(consecExceed, na.rm = T) :
no non-missing arguments to max; returning -Inf
> dim(test)
[1] 1872 4
> dim(tempDataSyncS)
[1] 228287 125
> head(test)
Source: local data frame [6 x 4]
Groups: featureid, year
featureid year maxConsec_22 meanConsec_22
1 730482 2009 4 2
2 730531 2004 -Inf NaN
3 730675 2009 -Inf NaN
4 730912 2009 -Inf NaN
5 730916 2009 -Inf NaN
6 731705 2004 -Inf NaN
> tail(test)
Source: local data frame [6 x 4]
Groups: featureid, year
featureid year maxConsec_22 meanConsec_22
1 892941 2012 5 1.620690
2 892941 2013 5 1.242424
3 894592 2011 6 1.590909
4 894592 2012 6 1.666667
5 894592 2013 3 1.444444
6 895539 1999 3 1.222222
it's probably because some of the featureid/years are all NA. you'll probably have to modify the function to check for that, something like this:
calcConsecExceed <- function(y, threshold = 22) {
if (all(is.na(y$tempPredicted))) {
maxConsec <- NA
meanConsec <- NA
} else {
exceed <- y$tempPredicted > threshold
consec <- rle(exceed)
consecExceed <- consec$lengths[consec$values]
maxConsec <- max(consecExceed, na.rm = T)
meanConsec <- mean(consecExceed, na.rm = T)
}
d <- data.frame(maxConsec=maxConsec, meanConsec=meanConsec)
names(d) <- paste(names(d), '_', threshold, sep='')
d
}
this would also be a good opportunity to write some unit tests. make some dummy runs and make sure it returns the values you think it should, and also handles NA values correctly.
NA were my first guess as well but I check and in this data frame I didn’t have any NA. It took a while but I realized it’s trying to take means and maxes when all or none of the values in the group are above the threshold, hence producing -Inf in some cases and NaN in others and “correct" answers to many. Without a better understanding of the inner workings of these functions, I’m still not quite sure why these were produced but this definitely fixed the error.
I spent a good amount of time this morning reading about testing and the testthat
package. I'd been wanting to do it so it's good that this was the incentive I needed. It's quite hard to write tests. At first I was trying to make the much too complicated but I eventually came up with some simple ones and put them in R scripts in the /inst/tests/
folder in the conteStreamTemperature
package. Here's the resulting function:
calcConsecExceed <- function(y, threshold = 22) {
if(all(is.na(y$tempPredicted))) {
warning("all values are NA")
maxConsec <- NA
meanConsec <- NA
} else if(all(y$tempPredicted <= threshold)) {
maxConsec <- 0
meanConsec <- 0
} else {
exceed <- y$tempPredicted > threshold
consec <- rle(exceed)
consecExceed <- consec$lengths[consec$values]
maxConsec = max(consecExceed, na.rm = T)
meanConsec = mean(consecExceed, na.rm = T)
}
d <- data.frame(maxConsec, meanConsec)
names(d) <- paste(names(d), '_', threshold, sep='')
return(d)
}
I don't love the nested ifelse
but it didn't seem appropriate for switch
so I went with it. Here are a few tests I came up with, although there are many more I could do (single NA, class type, sort order, etc.):
# test that returns a 0 if all values <= threshold
test_that("a zero is returned if all values below threshold") {
df <- data.frame(tempPredicted = runif(10, 2, 20))
consecExceed <- calcConsecExceed(df, 22)
expect_equal(consecExceed$maxConsec_22, 0)
expect_equal(consecExceed$meanConsec_22, 0)
}
# test if all values > threshold
test_that("a zero is returned if all values below threshold") {
df <- data.frame(tempPredicted = runif(10, 23, 30))
consecExceed <- calcConsecExceed(df, 22)
expect_equal(consecExceed$maxConsec_22, length(df$tempPredicted))
expect_equal(consecExceed$meanConsec_22, length(df$tempPredicted))
}
# test if all values are NA
test_that("NA is returned when all values are NA") {
df <- data.frame(tempPredicted = rep(NA, times = 10))
consecExceed <- calcConsecExceed(df, 22)
expect_true(is.na(consecExceed$maxConsec_22))
expect_true(is.na(consecExceed$meanConsec_22))
expect_warning(calcConsecExceed(df, 22))
}
Cool, that makes sense, I guess you can't know how long a run is if you don't really know when it started or ended.
Unit tests look good, you might add one to make sure its computing the mean and max correctly, like:
test_that("returns correct mean and max consecutive lengths") {
df <- data.frame(tempPredicted = c(10, 10, 25, 25, 25, 10, 10, 25, 25, 10, 25))
consecExceed <- calcConsecExceed(df, 22)
expect_equal(consecExceed$maxConsec_22, 3)
expect_equal(consecExceed$meanConsec_22, 2)
}
Good idea. Thanks.
Another thing we talked about in the Maine group meeting was to add another set of derived metrics that represented runs of consecutive days where the stream temperature exceeded some threshold. Maybe something like mean and max number of consecutive days per year where temp > 22 deg C, since that's closely tied to fish stress.