DrylandEcology / rSFSW2

rSFSW2: A R package to create soil water balance simulation experiment
GNU General Public License v3.0
8 stars 7 forks source link

Test project tests fail on Linux machines #359

Open nip5 opened 5 years ago

nip5 commented 5 years ago

Test project tests fail on Linux machines due to "NA" being coerced to NA in Linux and thus site 4 is skipped on database generation.

dschlaep commented 5 years ago

@CaitlinA @nip5

Did PR https://github.com/DrylandEcology/rSFSW2/pull/360 close this issue?

I don't think that this issue description is helpful, e.g., there is no reproducible example; there is no example output of the error message, it doesn't state which database (out of >= 3), etc.

Also claiming that this is an issue for (all) Linux machines is not helpful; travis-ci which we run with Ubuntu 14.04.5 LTS did not produce this issue (https://travis-ci.org/DrylandEcology/rSFSW2/builds/502228791 passing the previous PR for rSFW2 v3.1.3): i.e.,

[1] "rSFSW2's 'make_dbOutput': started at 2019-03-05 20:38:51"
[1] "rSFSW2's 'make_dbOutput': ended after 0.75 s"

Maybe it fails on your specific Linux machine, but then writing what setup you use would be more helpful for debugging. As it is, I see no specific reason why this was an issue with the R code of rSFSW2 and not with your specific setup.

Commit bdc60628ab20f9ab81335b3edbff37dd86130273 introduces new code to replace any NA (which carry a specific meaning in R) in weather folder names with a text string (of no specific meaning). I find this confusing without further explanation.

We used to remove NAs in weather folder names with na.exclude() (https://github.com/DrylandEcology/rSFSW2/blob/fc063547f1a79897378b558124a9928b800edd9c/R/OutputDatabase.R#L1843). Contrary to that old code snippet, it appears that when I changed our test project about 10 months ago, I changed how we re-write the InputMaster csv, i.e., we started to write "NA" instead of NA to the csv file (https://github.com/DrylandEcology/rSFSW2/commit/db3948e6aa746d012dbc4f25b0e9bb2a4186305f). Thus, the reference output database contains an entry for site 4 "NA". I guess that maybe my local computer and travis-ci, translate this into "NA" while yours reads it as NA from disk? The "NA" has ended up in the reference output database because it was no longer removed by na.exclude. Thus, if this is the problem, then I guess that instead of laboriously rewriting the NAs into "NA"s, simply removing the na.exclude (maybe with an additional as.character(SFSW2_prj_inputs[["SWRunInformation"]]$WeatherFolder) if needed) would produce the same result as the reference output database, but much easier and faster?

Since, originally these NAs indicated non-included sites, we may want to consider whether we prefer eventually to revert to the old behavior, i.e., not include those sites in the weatherfolders table?

Either way, looping over vectors tends to be slow and in our use cases site_data could have many rows; however, the code is executed only once and thus it does not matter much for the overall performance of a simulation run. Still, you may want to read up on vectorized code (e.g., https://datascienceplus.com/strategies-to-speedup-r-code/), i.e., the code in the commit

  for (i in seq(sites_data$WeatherFolder)){
    if (is.na(sites_data$WeatherFolder[i])){
      sites_data$WeatherFolder[i] <- "NA"
    }
  }

is equivalent to

sites_data$WeatherFolder[is.na(sites_data$WeatherFolder)] <- "NA"

Here, an example with 100,000 sites -- a reasonable setup for one of our simulation experiments: the for-loop is (on my machine) over 1,300 times slower...

sites_data <- data.frame(WeatherFolder = as.character(sample(1e5)),
  stringsAsFactors = FALSE)
sites_data$WeatherFolder[c(10, 50000)] <- NA

f1 <- function(x) {
  for (i in seq(x$WeatherFolder)){
    if (is.na(x$WeatherFolder[i])){
      x$WeatherFolder[i] <- "NA"
    }
  }
  x
}

f2 <- function(x) {
  x$WeatherFolder[is.na(x$WeatherFolder)] <- "NA"
  x
}

identical(f1(sites_data), f2(sites_data))

microbenchmark::microbenchmark(f1(sites_data), f2(sites_data))

But again, simply removing na.exclude from the call in line 1843 should have done the trick as well, i.e.,

      temp <- unique(SFSW2_prj_inputs[["SWRunInformation"]]$WeatherFolder)