Run biocro_regional with calibrated configuration file

haohuuiuc commented 8 years ago

Hi David, I guess I only need to replace config.xml under run/ directory and execute job.sh to run biocro regional on ROGER. But these files and folders are generated by RStudio from the VM, if we would like to submit the job through RStudio interface, should we put the calibrated configuration file somewhere in VM?

dlebauer commented 8 years ago

PEcAn doesn't currently have the capacity to do these analyses (using one set of parameters and propagating met uncertainty). Indeed, I used a very straight-forward MLE method rather than the Bayesian calibration in PEcAn for this because we aren't propagating the parameter uncertainty that has been a central focus of PEcAn development to date. it would be great to build this in, but for now, lets work backward from the analysis that we implemented on Roger.

So in short, lets not worry about the VM for now, until we want to calibrate the model again.

Does that sound sensible?

On Mon, Oct 12, 2015 at 2:48 PM, Hao Hu notifications@github.com wrote:

Hi David, I guess I only need to replace config.xml under run/ directory and execute job.sh to run biocro regional on ROGER. But these files and folders are generated by RStudio from the VM, if we would like to submit the job through RStudio interface, should we put the calibrated configuration file somewhere in VM?

— Reply to this email directly or view it on GitHub https://github.com/ebimodeling/biocro_regional/issues/16.

haohuuiuc commented 8 years ago

Still a little bit confused for the part running biocro_regional without VM. What we did before is to configure the illinois_mxg_settings.xml on VM and then do the job submission there. Then we are able to see the output on ROGER as well as config.xml and job.sh under run folder for each job. I know we can directly execute the job.sh to run biocro_regional on ROGER. For the calibrated configuration file, I guess I should choose one of the job folder (e.g. 99000000002), replace config.xml with calibrated one, and then run job.sh. Is that the correct way to directly run biocro_regional on ROGER?

dlebauer commented 8 years ago

The key bit is to run the script

/gpfs_scratch/haohu3/pecan/models/biocro/inst/regionalbiocro_met_uncertainty.Rscript

using the calibrated config.xml

I forget how we submitted this the last time - the key bit is that we need to parallelize the for loop so that it doesn't take forever and doesn't reserve a whole node and only use a single core.

On Mon, Oct 12, 2015 at 3:29 PM, Hao Hu notifications@github.com wrote:

Still a little bit confused for the part running biocro_regional without VM. What we did before is to configure the illinois_mxg_settings.xml on VM and then do the job submission there. Then we are able to see the output on ROGER as well as config.xml and job.sh under run folder for each job. I know we can directly execute the job.sh to run biocro_regional on ROGER. For the calibrated configuration file, I guess I should choose one of the job folder (e.g. 99000000002), replace config.xml with calibrated one, and then run job.sh. Is that the correct way to directly run biocro_regional on ROGER?

— Reply to this email directly or view it on GitHub https://github.com/ebimodeling/biocro_regional/issues/16#issuecomment-147511506 .

haohuuiuc commented 8 years ago

That rings a bell. Last time, we added the ability to sample from met data and select random seed in the R code. I remembered we discussed a little bit about paralleling the code with parallel R loop (e.g., foreach) to improve the performance last time, but did not implement it yet. I think I can take it from here. Thanks!

haohuuiuc commented 8 years ago

David, have the calibrated results adjusted by the decreasing factor for each year along time? If not, can you remind me where should I find the those parameters?

dlebauer commented 8 years ago

I have not added this to the script. Eventually, I plan to add it to the run.biocro function. But I have not done so yet. At this point, it would be easier (less prone to error and having to redo the runs) to do the revise the output after the runs are finished.

But you can see the code used to fix the data in vignettes/regional_miscanthus_calibration.Rmd,

the key bit is:

ricker <- function(t, phi1 = 1, phi2 = 1, phi3 = 1) {
  y <- phi1 * (t^phi2) * exp(-phi3 * t)
  return(y)
}

correct.yield <- function(biocro_result){
  start <- min(biocro_result$year)

  # lesur correction
  corrected_lessur <- ricker(0:20, 4.12, 1.12, 0.13)
  age_correction <- corrected_lessur/max(corrected_lessur)

  # from Miguez et al 2013
  # http://onlinelibrary.wiley.com/doi/10.1111/j.1757-1707.2011.01150.x/epdf; 
  # multiplied by 0.67 to take account of losses in senescence, postsenescence, 
  # and during harvest (Clifton-Brown et al., 2004; Heaton et al., 2008).
  harvest_correction <- 0.67 
  correction <- age_correction * harvest_correction

  correction <- data.table(year = start + 0:(length(correction) - 1), adjust = correction)

  a <- merge(biocro_result, correction, by = 'year') 
  a[, `:=`(corrected_yield = Yield * adjust)]
  return(a)
}

## In the vignette, I use lapply on a list of results, but this is how the function would work:
correct_yield(biocro_result)

Looking through the code, I notice that if you are using the version of pecan in your directory on scratch, we need to add met_uncertainty = TRUE to the call to run.biocro in the file pecan/models/biocro/inst/regionalbiocro_met_uncertainty.Rscript

haohuuiuc commented 8 years ago

Thanks, David! I've already have the yield correction code applied. One thing is that I noticed the age_correction does not reach the peak value until year 10. Tao suggested that the peak harvest year for Miscanthus should be around year 4. Does it look right?

> age_correction
 [1] 0.0000000 0.2414986 0.4609035 0.6373440 0.7724084 0.8708191 0.9378916 0.9787567 0.9980848 1.0000000 0.9880754 0.9653652 0.9344510
[14] 0.8974945 0.8562890 0.8123083 0.7667518 0.7205845 0.6745732 0.6293172 0.5852762

dlebauer commented 8 years ago

Hmm we are both correct. I got that function from a review by Lesur et al; here is the figure:

I used that function to calibrate the model because I didn't want to use the same data for the adjustment and calibration. I need to think about this and look back at Becky Arundale's data and will get back to you.

dlebauer commented 8 years ago

In short, it wouldn't be wrong to use this correction, though we can certainly do a better job. Its mostly a matter of fitting the ricker model to the Illinois data with a random effect of site. If I give you the Arundale data, would you be able to do this?

taolin1 commented 8 years ago

If I am not mistaken, I recall the age correction rate you shared with me before is different.

Can we use our existing Illinois data to generate an age correction rate? The curve by Lesuer et al seems a very slow increase in the first 5 years and slow decrease after year 10; while the other curve I recalled has a fast increase at first 3 years but decrease fast after year 5.

I agree with you that both methods are correct in theory. Shall we find a time to meet to finalize this before moving to climate data set generation and associated group job submissions?

On Oct 14, 2015, at 4:38 PM, David LeBauer notifications@github.com wrote:

In short, it wouldn't be wrong to use this correction, though we can certainly do a better job. Its mostly a matter of fitting the ricker model to the Illinois data with a random effect of site. If I give you the Arundale data, would you be able to do this?

— Reply to this email directly or view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ebimodeling_biocro-5Fregional_issues_16-23issuecomment-2D148209878&d=BQMCaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=WMJKW94VqGLYjSbI6YrTJ9xywEYF-yldNx6D6Bi7k7w&m=y9D3mlOHwHbZpBUtMwoJiO1XDeKeuOXhQQBigrC9VS4&s=4DvDBO7tKrxX2EQ5p-1KdASotR6VqlDAuY19yIx90ZM&e=.

dlebauer commented 8 years ago

I think it is clear that we should use the Arundale data. It would be better to add the ricker model parameters to the calibration, but in lieu of that, we can fit it given the Arundale data. It is mostly a challenge to fit appropriately, and to make sure that we don't 'double dip' with the calibration data.

there are two reasons the Lesur may have that trend. first, it could be influenced by a few sites with long time series. Second, it is likely that in Europe the crops and pathogens grow more slowly and do not deplete resources as quickly.

haohuuiuc commented 8 years ago

I can try to do the fitting. I guess the Arundale data will look like the points in the plot and what I need to do is to find the parameters of a fitting curve (similar as in ricker model), is that correct?

taolin1 commented 8 years ago

Another thread for model simulation: Would BioCro consider the impact of inter-annual weather variations? Saying: if we have two sets of 4-year weather data: (1999,2000,2001,2002) versus (1997,1999,2001,2002), would the yield in 2002 be the same in two sets?

Hao told me that the results look the same. If that is the case, how shall we incorporate the inter-annual weather variations in the analysis?

dlebauer commented 8 years ago

@haohuuiuc

OK. instead of using corrected_lesur, try using this parameterization for the correction (other than replacing corrected_lesur for corrected_arundale, the above code should work now):

corrected_arundale <- ricker(0:20, 5.2 2.6 0.46)

This is how I got the parameters (directly from the figure below)

d <- data.frame(age = 1:10, arundale = c(5, 10, 21.5, 31.2, 34.5, 33.3, 29.5, 25.1, 23, 23.2))
nls(arundale ~ a * (age) * exp(-c * age), data = d, start=list(a = 4, c = 0.5))

Note that this parameterization reaches a max at year 5, (approximately b / c = 5.6 like the figure below:

t <- seq(0,10, by = 0.1)
y <- ricker(t, 5.2, 2.6, 0.46)
t[which.max(y)]

2.6 / 0.46

Figure from Arundale 2013 (Miscanthus on right):

dlebauer commented 8 years ago

@taolin1 @haohuuiuc the weather history should determine the yield of a given year. It is possible that the randomization code I put into run.biocro is not working, but the best place to start would be to open a new issue with the code used to determine otherwise.

haohuuiuc commented 8 years ago

David, I was able to do the yield correction based on the corrected_arundale value provided in your code. Here is how the result looks like,

yield_from_1996_2010

In this case, I am doing a 15-year biocro model run, from 1996 to 2010 without random sampling. Year 1996 is the plant year, and the peak yield occurs at year 2002. The maximum corrected yield reaches at 18.382 (the unit is Mg/ha i guess?), do you think this result looks correct?

dlebauer commented 8 years ago

It looks too low by about 1/3 Can you point me to the raw model output and the code that you used to generate the figures?

haohuuiuc commented 8 years ago

The annual.result.RData is on the following path on ROGER, /gpfs_scratch/haohu3/pecan/models/biocro/inst/IL/out/annual.result.RData

The R code I have is as follows,

require(data.table)
require(sp)
require(raster)
load("annual.result.RData")

# Correct yield
ricker <- function(t, phi1 = 1, phi2 = 1, phi3 = 1) {
  y <- phi1 * (t^phi2) * exp(-phi3 * t)
  return(y)
}

correct.yield <- function(biocro_result){
  start <- min(biocro_result$year)

  # lesur correction
  #corrected_lessur <- ricker(0:20, 4.12, 1.12, 0.13)
  corrected_arundale <- ricker(0:20, 5.2, 2.6, 0.46)
  age_correction <- corrected_arundale/max(corrected_arundale)

  # from Miguez et al 2013
  # http://onlinelibrary.wiley.com/doi/10.1111/j.1757-1707.2011.01150.x/epdf; 
  # multiplied by 0.67 to take account of losses in senescence, postsenescence, 
  # and during harvest (Clifton-Brown et al., 2004; Heaton et al., 2008).
  harvest_correction <- 0.67 
  correction <- age_correction * harvest_correction

  correction <- data.table(year = start + 0:(length(correction) - 1), adjust = correction)

  a <- merge(biocro_result, correction, by = 'year') 
  a[, `:=`(corrected_yield = Stem * adjust)]
  return(a)
}
correct_yield<-correct.yield(annual.result)

# funtion to create a raster
createRaster <- function(df, brks, cols, name) {
  r <- raster(ncols=16, nrows=23, xmn=-91.50, xmx=-87.50, ymn=37.00, ymx=42.75)
  y <- df$lat
  x <- df$lon
  xy <- cbind(x,y)
  r0 <- rasterize(xy, r, df$corrected_yield, fun=mean)
  plot(r0, breaks=brks, col=cols, main=name)
  #plot(counties.il, add=T)
}

# Add raster layer of yield
yield.max <- max(correct_yield$corrected_yield)
yield.min <- min(correct_yield$corrected_yield)
step <- (yield.max - yield.min)/8
brks <- seq(yield.min, yield.max, by=step)
brks <- round(brks,1)
nb <- length(brks)-1 
cols <- rev(terrain.colors(nb))

par( mfrow = c(3, 5))
for (i in 1996:2010){
  createRaster(subset(correct_yield, year == i), brks, cols, i)
}

dlebauer commented 8 years ago

I am getting

Error in plot(counties.il, add = T) : error in evaluating the argument 'x' in selecting a method for function 'plot': Error: object 'counties.il' not found

I've also edited your comment to include the required libraries.

haohuuiuc commented 8 years ago

Sorry that I forgot to comment out this line,

plot(counties.il, add=T)

It is just for adding the county boundaries to the map. To make this line work, you will need to load a shapefile first,

library("rgdal")
counties.il <- readOGR("./shp", "il") # il is the shapefile in the working directory

dlebauer commented 8 years ago

In general, this isn't doing what I thought it was doing, and I will have to check.

My fault - I forgot to yield should be (Stem + Leaf) * adjust. I can not recall if I did this in the calibration step
But in the current code, using stem, I would expect that in year 6, corrected_yield would equal Stem. So the following plot should have all points on the 1:1 line, but doesnt
```
correct_yield[year == 1986,plot(Stem, corrected_yield)]; abline(1:15,1:15)
```

So I will have to take a look to figure out where the errors are.

haohuuiuc commented 8 years ago

Thanks for exploring this. If you think it is necessary, we can have a meeting to work around this.

dlebauer commented 8 years ago

to parallelize around the lat/lon points, make sure to regester enough cores in your qsub statement then do this:

qsub -l walltime=03:00:00,nodes=1:ppn=20

library(foreach)
library(doMC)
registerDoMC(20)
foreach(point=1:nrow(latlon)) %dopar% {

}

haohuuiuc commented 8 years ago

David, can you help me check the following R code on ROGER when you have time? /gpfs_scratch/haohu3/pecan/models/biocro/inst/regionalbiocro_met_uncertainty.Rscript, I don't think run.biocro function takes the met.uncertainty parameter correctly. The corresponding run.biocro.R are at /gpfs_scratch/haohu3/pecan/models/biocro/R.

dlebauer commented 8 years ago

what error occurs? Please see http://adv-r.had.co.nz/Reproducibility.html

haohuuiuc commented 8 years ago

Error in BioGro(WetDat = WetDat, day1 = day1, dayn = dayn, soilControl = soil.parms, : day1 and dayn should be between 0 and 365 Calls: run.biocro -> BioGro Please see the following error, In addition: Warning messages: 1: In max(doy) : no non-missing arguments to max; returning -Inf 2: In min(doy) : no non-missing arguments to min; returning Inf Called from: stop("day1 and dayn should be between 0 and 365")

Here is the R code for regionalbiocro_met_uncertainty.Rscript

#!/usr/bin/env Rscript

args   <- commandArgs(trailingOnly = TRUE)
rundir <- args[1]
outdir <- args[2]

#library(PEcAn.all)
require(PEcAn.data.land)
require(PEcAn.BIOCRO)
require(BioCro)
require(PEcAn.data.atmosphere)
require(PEcAn.utils)
require(lubridate)

options(error = browser)

if(interactive()) {

  library(PEcAn.settings)
  settings <- read.settings("/gpfs_scratch/haohu3/pecan/models/biocro/inst/IL/illinois_mxg_settings.xml")
  runid <- tail(readLines(file.path(settings$rundir, "runs.txt")), n = 1)
  rundir <- file.path(settings$rundir, runid)
  outdir <- file.path(settings$outdir, "out", runid)
  point <- 1
}

config <- read.biocro.config(file.path(rundir, "calibrated_config.xml"))

met.nc  <- nc_open(config$run$met.file)
soil.nc <- nc_open(config$run$soil.file)
# atmco2.nc <- nc_open(file.path(inputdir, "co2/CO2_Global_HD_v1.nc"))

lat <- ncvar_get(met.nc, "latitude")
lon <- ncvar_get(met.nc, "longitude")
latlon <- expand.grid(lat = lat, lon = lon)

annual.result <- NULL
biocro_result <- NULL

for(i in 1:1){
 for(point in 1:1) {
  set.seed(i)
  lat <- latlon$lat[point]
  lon <- latlon$lon[point]
  out <- run.biocro(lat, lon, met.nc = met.nc, soil.nc = soil.nc, config = config, met.uncertainty=TRUE)
  #out <- run.biocro(lat, lon, met.nc = met.nc, soil.nc = soil.nc, config = config)
  annual.result <- rbind(annual.result, out$annually)
  save(annual.result, file = file.path(outdir, "annual.result.RData"))

  point.outdir <- file.path(outdir, i, paste0(lat, 'x', lon))
  dir.create(point.outdir, showWarnings = FALSE, recursive = TRUE)

  hourly <- out$hourly
  daily <- out$daily
  #save(hourly, file = file.path(point.outdir, 'hourly.result.RData'))
  #save(daily, file = file.path(point.outdir, 'daily.result.RData'))

  biocro_result <- rbind(biocro_result, data.table(lat = lat, lon = lon, daily))

  model2netcdf.BIOCRO(result = hourly, genus = config$pft$type$genus, outdir = point.outdir,
                      lat = lat, lon = lon)
  rm(hourly)
  rm(daily)
 }
}
save(biocro_result, file = file.path(outdir, "biocro_output.RData"))

By setting the met.uncertainty parameter to TRUE,

out <- run.biocro(lat, lon, met.nc = met.nc, soil.nc = soil.nc, config = config, met.uncertainty=TRUE)

We call run.biocro.R with an additional step to sample years from met data, here is the extra code in run.biocro.R to handle when met.uncertainty is TRUE,

 if(met.uncertainty == TRUE){
    met <- load.cfmet(met.nc, lat = lat, lon = lon, start.date = ceiling_date(as.POSIXct("1979-01-01"), "day"), end.date = floor_date(as.POSIXct("2010-12-31"),"day"))
    #met <- load.cfmet(met.nc, lat = lat, lon = lon, start.date = start.date, end.date = end.date)

    #year.sample <- met[,list(year = sample(unique(year), size = 15))]
    #met <- met[year %in% year.sample$year]
    year.sample <- list(year = sample(unique(met$year), size = 15))
    met <- met[met$year %in% year.sample$year,]
  } else {
    met <- load.cfmet(met.nc, lat = lat, lon = lon, start.date = start.date, end.date = end.date)
  }

dlebauer commented 8 years ago

thanks. I had to fix a bunch of things in the code to get this working. They are in commit pecanproject/pecan@c508b5bf79808bd1e82fc7c50dfa089448c56423.

It should work now after you pull changes from github (or copy regional_met_uncertainty.Rscript and run.biocro.R from /home/dlebauer/dev/pecan/biocro) and rebuild the PEcAn.BIOCRO module

e.g.

cd /gpfs_scratch/haohu3/pecan/
git reset --hard HEAD
git pull origin biocro-module2
R CMD INSTALL models/biocro

OR

rsync -r /home/dlebauer/dev/pecan/biocro /gpfs_scratch/haohu3/pecan/models/
R CMD INSTALL /gpfs_scratch/haohu3/pecan/models/biocro

THEN you can run the script

This is how I ran it for testing

/home/dlebauer/pecan/models/biocro/inst/regional_met_uncertainty.Rscript /home/dlebauer/pecan_remote/mxg_calibration/Brownstown_Research/run/28523/ /tmp/

and the results in /tmp/annual.result.RData look okay!

but you will probably need to change the name of the calibrated_config.xml file to config.xml and place it in some run directory.

haohuuiuc commented 8 years ago

Thanks David for the quick fix. I got the code working. One thing I am curious about is that the uncertainty run generated 13 records for each scenario instead of 15 (# of samples you set in the sample function) in my test run. The output can be found here,

/gpfs_scratch/haohu3/pecan/models/biocro/inst/IL/out/annual.result.RData.

dlebauer commented 8 years ago

the easy fix is to change replace = TRUE to replace = FALSE

haohuuiuc commented 8 years ago

I see, it's due to the duplicated year value during the sampling. Thanks!

dlebauer commented 8 years ago

And I've pushed (but not extensively tested) a fix that should allow sampling with replacement: PecanProject/pecan@b2ca2c120997ab503e4ae665cf2ac7ea76800c9d

Then cleaned it up a little bit for clarity here: PecanProject/pecan@13889d013d0e89016f8e00b1fdb6b2d674ba116f

haohuuiuc commented 8 years ago

David, I have generated the results of Champaign for 500 runs. To calculate the yield, should I use, (Stem + Leaf) * adjust ? The data is in the following path on ROGER,

/gpfs_scratch/haohu3/pecan/models/biocro/inst/IL/out

Please let me know if you find anything wrong this result.

dlebauer commented 8 years ago

Yes, stem+leaf * adjust

On Fri, Nov 6, 2015 at 10:34 AM, Hao Hu notifications@github.com wrote:

David, I have generated the results of Champaign for 500 runs. To calculate the yield, should I use, (Stem + Leaf) * adjust ? The data is in the following path on ROGER,

/gpfs_scratch/haohu3/pecan/models/biocro/inst/IL/out

Please let me know if you find anything wrong this result.

— Reply to this email directly or view it on GitHub https://github.com/ebimodeling/biocro_regional/issues/16#issuecomment-154461693 .

haohuuiuc commented 8 years ago

Thanks, David. I updated the result for Champaign here /gpfs_scratch/haohu3/pecan/models/biocro/inst/IL/out_champaign on ROGER. And I applied the age and harvest correction with the following R code,

# Correct yield
ricker <- function(t, phi1 = 1, phi2 = 1, phi3 = 1) {
  y <- phi1 * (t^phi2) * exp(-phi3 * t)
  return(y)
}

annual.result$yeari<-round(annual.result$yearindex/10000)
correct.yield <- function(biocro_result){
  start <- min(biocro_result$yeari)

  # lesur correction
  corrected_arundale <- ricker(0:20, 5.2, 2.6, 0.46)
  age_correction <- corrected_arundale/max(corrected_arundale)

  # from Miguez et al 2013
  # http://onlinelibrary.wiley.com/doi/10.1111/j.1757-1707.2011.01150.x/epdf; 
  # multiplied by 0.67 to take account of losses in senescence, postsenescence, 
  # and during harvest (Clifton-Brown et al., 2004; Heaton et al., 2008).
  harvest_correction <- 0.67 
  correction <- age_correction * harvest_correction

  correction <- data.table(yeari = start + 0:(length(correction) - 1), adjust = correction)

  a <- merge(biocro_result, correction, by = 'yeari') 
  a[, `:=`(corrected_yield = (Stem+Leaf) * adjust)]
  return(a)
}
correct_yield<-correct.yield(annual.result)

The yield now ranges from 0 to 19.08. Do you think this value is still too low? If you think it looks okay, I will start the run for entire Illinois.

dlebauer commented 8 years ago

Yes it seems too low. I will review my calibration over the weekend

dlebauer commented 8 years ago

@haohuuiuc OK. I have updated the calibration step to calibrate against the mean yields for each site, rather than each year independently, and have selected the parameter set that minimizes the RMSE over the training / calibration data: https://gist.github.com/dlebauer/1cdcda5bdc7ce2818dc8

Give this one a shot.

haohuuiuc commented 8 years ago

Thanks, David. After applying the new config.xml, the maximum corrected yield increases to 30.92 for 500 runs in Champaign. Do you think it now looks reasonable? The raw yield data is generated on ROGER /gpfs_scratch/haohu3/pecan/models/biocro/inst/IL/out

dlebauer commented 8 years ago

Yes that is reasonable. Go for it! After IL we can do the US.

On Nov 10, 2015, at 4:39 PM, Hao Hu notifications@github.com wrote:

Thanks, David. After applying the new config.xml, the maximum corrected yield increases to 30.92 for 500 runs in Champaign. Do you think it now looks reasonable? The raw yield data is generated on ROGER /gpfs_scratch/haohu3/pecan/models/biocro/inst/IL/out

— Reply to this email directly or view it on GitHub.

ebimodeling / biocro_regional

Run biocro_regional with calibrated configuration file #16