Open nip5 opened 5 years ago
@CaitlinA @nip5
Did PR https://github.com/DrylandEcology/rSFSW2/pull/360 close this issue?
I don't think that this issue description is helpful, e.g., there is no reproducible example; there is no example output of the error message, it doesn't state which database (out of >= 3), etc.
Also claiming that this is an issue for (all) Linux machines is not helpful; travis-ci which we run with Ubuntu 14.04.5 LTS
did not produce this issue (https://travis-ci.org/DrylandEcology/rSFSW2/builds/502228791 passing the previous PR for rSFW2 v3.1.3): i.e.,
[1] "rSFSW2's 'make_dbOutput': started at 2019-03-05 20:38:51"
[1] "rSFSW2's 'make_dbOutput': ended after 0.75 s"
Maybe it fails on your specific Linux machine, but then writing what setup you use would be more helpful for debugging. As it is, I see no specific reason why this was an issue with the R code of rSFSW2 and not with your specific setup.
Commit bdc60628ab20f9ab81335b3edbff37dd86130273 introduces new code to replace any NA
(which carry a specific meaning in R) in weather folder names with a text string (of no specific meaning). I find this confusing without further explanation.
We used to remove NAs in weather folder names with na.exclude()
(https://github.com/DrylandEcology/rSFSW2/blob/fc063547f1a79897378b558124a9928b800edd9c/R/OutputDatabase.R#L1843). Contrary to that old code snippet, it appears that when I changed our test project about 10 months ago, I changed how we re-write the InputMaster
csv, i.e., we started to write "NA"
instead of NA
to the csv
file (https://github.com/DrylandEcology/rSFSW2/commit/db3948e6aa746d012dbc4f25b0e9bb2a4186305f). Thus, the reference output database contains an entry for site 4 "NA". I guess that maybe my local computer and travis-ci, translate this into "NA"
while yours reads it as NA
from disk? The "NA"
has ended up in the reference output database because it was no longer removed by na.exclude
. Thus, if this is the problem, then I guess that instead of laboriously rewriting the NA
s into "NA"
s, simply removing the na.exclude
(maybe with an additional as.character(SFSW2_prj_inputs[["SWRunInformation"]]$WeatherFolder)
if needed) would produce the same result as the reference output database, but much easier and faster?
Since, originally these NA
s indicated non-included sites, we may want to consider whether we prefer eventually to revert to the old behavior, i.e., not include those sites in the weatherfolders
table?
Either way, looping over vectors tends to be slow and in our use cases site_data
could have many rows; however, the code is executed only once and thus it does not matter much for the overall performance of a simulation run. Still, you may want to read up on vectorized code (e.g., https://datascienceplus.com/strategies-to-speedup-r-code/), i.e., the code in the commit
for (i in seq(sites_data$WeatherFolder)){
if (is.na(sites_data$WeatherFolder[i])){
sites_data$WeatherFolder[i] <- "NA"
}
}
is equivalent to
sites_data$WeatherFolder[is.na(sites_data$WeatherFolder)] <- "NA"
Here, an example with 100,000 sites -- a reasonable setup for one of our simulation experiments: the for-loop is (on my machine) over 1,300 times slower...
sites_data <- data.frame(WeatherFolder = as.character(sample(1e5)),
stringsAsFactors = FALSE)
sites_data$WeatherFolder[c(10, 50000)] <- NA
f1 <- function(x) {
for (i in seq(x$WeatherFolder)){
if (is.na(x$WeatherFolder[i])){
x$WeatherFolder[i] <- "NA"
}
}
x
}
f2 <- function(x) {
x$WeatherFolder[is.na(x$WeatherFolder)] <- "NA"
x
}
identical(f1(sites_data), f2(sites_data))
microbenchmark::microbenchmark(f1(sites_data), f2(sites_data))
But again, simply removing na.exclude
from the call in line 1843 should have done the trick as well, i.e.,
temp <- unique(SFSW2_prj_inputs[["SWRunInformation"]]$WeatherFolder)
Test project tests fail on Linux machines due to "NA" being coerced to NA in Linux and thus site 4 is skipped on database generation.