PecanProject / pecan

The Predictive Ecosystem Analyzer (PEcAn) is an integrated ecological bioinformatics toolbox.
www.pecanproject.org
Other
202 stars 235 forks source link

Error in PEcAn.data.atmosphere::download.CRUNCEP() #1649

Closed serbinsh closed 5 years ago

serbinsh commented 7 years ago

When using the latest develop branch I am getting this error when trying to use PEcAn.data.atmosphere::download.CRUNCEP()

> remotefunc <- function() {PEcAn.data.atmosphere::download.CRUNCEP(site_id=1000000105, lat.in=9.154, lon.in=-79.848, model=NULL, scenario=NULL, ensemble_member=NULL, overwrite=FALSE, outfolder='/data/Model_Output/pecan.output/dbfiles/CRUNCEP_site_1-105/', start_date='1990-01-01', end_date='2004-12-31')}
> remoteout <- remotefunc()
2017-09-13 08:30:22 INFO   [PEcAn.logger::logger.info] : 
   Downloading 
   /data/Model_Output/pecan.output/dbfiles/CRUNCEP_site_1-105//CRUNCEP.1990.nc 
2017-09-13 08:30:22 INFO   [PEcAn.logger::logger.info] : 
   https://thredds.daac.ornl.gov/thredds/dodsC/ornldaac/1220/mstmip_driver_global_hd_climate_tair_1990_v1.nc4 
2017-09-13 08:30:32 INFO   [PEcAn.logger::logger.info] : 
   https://thredds.daac.ornl.gov/thredds/dodsC/ornldaac/1220/mstmip_driver_global_hd_climate_lwdown_1990_v1.nc4 
2017-09-13 08:30:40 INFO   [PEcAn.logger::logger.info] : 
   https://thredds.daac.ornl.gov/thredds/dodsC/ornldaac/1220/mstmip_driver_global_hd_climate_press_1990_v1.nc4 
2017-09-13 08:30:47 INFO   [PEcAn.logger::logger.info] : 
   https://thredds.daac.ornl.gov/thredds/dodsC/ornldaac/1220/mstmip_driver_global_hd_climate_swdown_1990_v1.nc4 
2017-09-13 08:30:52 INFO   [PEcAn.logger::logger.info] : 
   https://thredds.daac.ornl.gov/thredds/dodsC/ornldaac/1220/mstmip_driver_global_hd_climate_uwind_1990_v1.nc4 
2017-09-13 08:30:59 INFO   [PEcAn.logger::logger.info] : 
   https://thredds.daac.ornl.gov/thredds/dodsC/ornldaac/1220/mstmip_driver_global_hd_climate_vwind_1990_v1.nc4 
Error in Rsx_nc4_get_vara_double: NetCDF: Index exceeds dimension bound
Var: time  Ndims: 1   Start: 0 Count: 1460
Error in ncvar_get_inner(d$dimvarid$group_id, d$dimvarid$id, default_missval_ncdf4(),  : 
  C function R_nc4_get_vara_double returned error
Calls: remotefunc -> <Anonymous> -> <Anonymous> -> ncvar_get_inner
Execution halted

Description

Bug fix. I think the issue is the use of "verbose" in the call to ncdf4::ncvar_def

It should be verbose= TRUE/FALSE but currently in the function it is:

      var.list[[j]] <- ncdf4::ncvar_def(name = as.character(var$CF.name[j]),
                                 units = as.character(var$units[j]),
                                 dim = dim,
                                 missval = -999,
                                 verbose = verbose)

Which leads to this error:

>       var.list[[j]] <- ncdf4::ncvar_def(name = as.character(var$CF.name[j]),
+                                  units = as.character(var$units[j]),
+                                  dim = dim,
+                                  missval = -999,
+                                  verbose = verbose)
Error in ncdf4::ncvar_def(name = as.character(var$CF.name[j]), units = as.character(var$units[j]),  :
  object 'verbose' not found

But if you change to verbose = TRUE

>       var.list[[j]] <- ncdf4::ncvar_def(name = as.character(var$CF.name[j]),
+                                  units = as.character(var$units[j]),
+                                  dim = dim,
+                                  missval = -999,
+                                  verbose = TRUE)
[1] "ncvar_def: entering"
[1] "ncvar_def: prec= float"
[1] "ncvar_def: making ncvar object for var eastward_wind"

But we probably want verbose = FALSE?

Context

Bug fix

Possible Implementation

NA

serbinsh commented 7 years ago

With verbose on

Loading required package: PEcAn.benchmark
>  download.CRUNCEP(site_id=2000000003, lat.in=9.280225, lon.in=-79.975527, model=NULL, scenario=NULL, ensemble_member=NULL, overwrite=TRUE, outfolder='/data/Model_Output/p2017-09-13 09:16:10 INFO   [PEcAn.logger::logger.info] : 01/01', end_date='2002/12/31')
   Downloading
   /data/Model_Output/pecan.output/dbfiles/CRUNCEP_site_2-3//CRUNCEP.2001.nc
2017-09-13 09:16:10 INFO   [PEcAn.logger::logger.info] :
   https://thredds.daac.ornl.gov/thredds/dodsC/ornldaac/1220/mstmip_driver_global_hd_climate_tair_2001_v1.nc4
[1] "ncvar_def: entering"
[1] "ncvar_def: prec= float"
[1] "ncvar_def: making ncvar object for var air_temperature"
2017-09-13 09:16:17 INFO   [PEcAn.logger::logger.info] :
   https://thredds.daac.ornl.gov/thredds/dodsC/ornldaac/1220/mstmip_driver_global_hd_climate_lwdown_2001_v1.nc4
[1] "ncvar_def: entering"
[1] "ncvar_def: prec= float"
[1] "ncvar_def: making ncvar object for var surface_downwelling_longwave_flux_in_air"
serbinsh commented 7 years ago

OK, so turing on verbose helped it through one year but then failed on the next?

2017-09-13 09:16:55 INFO   [PEcAn.logger::logger.info] :
   https://thredds.daac.ornl.gov/thredds/dodsC/ornldaac/1220/mstmip_driver_global_hd_climate_qair_2001_v1.nc4
[1] "ncvar_def: entering"
[1] "ncvar_def: prec= float"
[1] "ncvar_def: making ncvar object for var specific_humidity"
2017-09-13 09:17:05 INFO   [PEcAn.logger::logger.info] :
   https://thredds.daac.ornl.gov/thredds/dodsC/ornldaac/1220/mstmip_driver_global_hd_climate_rain_2001_v1.nc4
[1] "ncvar_def: entering"
[1] "ncvar_def: prec= float"
[1] "ncvar_def: making ncvar object for var precipitation_flux"
2017-09-13 09:17:14 INFO   [PEcAn.logger::logger.info] :
   Downloading
   /data/Model_Output/pecan.output/dbfiles/CRUNCEP_site_2-3//CRUNCEP.2002.nc
2017-09-13 09:17:14 INFO   [PEcAn.logger::logger.info] :
   https://thredds.daac.ornl.gov/thredds/dodsC/ornldaac/1220/mstmip_driver_global_hd_climate_tair_2002_v1.nc4
Error in Rsx_nc4_get_vara_double: NetCDF: file not found
Var: time  Ndims: 1   Start: 0 Count: 1460
Error in ncvar_get_inner(d$dimvarid$group_id, d$dimvarid$id, default_missval_ncdf4(),  :
  C function R_nc4_get_vara_double returned error

And if I go to the output I can see the files:

-rw-rw-r--. 1 apache test 59760 Sep 13 09:17 CRUNCEP.2001.nc
-rw-rw-r--. 1 apache test 59732 Dec  5  2016 CRUNCEP.2002.nc
-rw-rw-r--. 1 apache test 59732 Dec  5  2016 CRUNCEP.2003.nc
-rw-rw-r--. 1 apache test 59892 Dec  5  2016 CRUNCEP.2004.nc
-rw-rw-r--. 1 apache test 59732 Dec  5  2016 CRUNCEP.2005.nc
-rw-rw-r--. 1 apache test 59732 Dec  5  2016 CRUNCEP.2006.nc
serbinsh commented 7 years ago

Strange, if I try to check the next file, 2002 lwdown it does suggest it isn;t there?

> ncdf4::nc_open('https://thredds.daac.ornl.gov/thredds/dodsC/ornldaac/1220/mstmip_driver_global_hd_climate_lwdown_2002_v1.nc4')
Error in Rsx_nc4_get_vara_double: NetCDF: file not found
Var: time  Ndims: 1   Start: 0 Count: 1460
Error in ncvar_get_inner(d$dimvarid$group_id, d$dimvarid$id, default_missval_ncdf4(),  :
  C function R_nc4_get_vara_double returned error

But on the web I can see the file?

serbinsh commented 7 years ago
screen shot 2017-09-13 at 9 23 17 am
serbinsh commented 7 years ago

Ugh! Wait now files that did work aren't workign

> ncdf4::nc_open('https://thredds.daac.ornl.gov/thredds/dodsC/ornldaac/1220/mstmip_driver_global_hd_climate_tair_2001_v1.nc4')
Error in Rsx_nc4_get_vara_double: NetCDF: file not found
Var: time  Ndims: 1   Start: 0 Count: 1460
Error in ncvar_get_inner(d$dimvarid$group_id, d$dimvarid$id, default_missval_ncdf4(),  :
  C function R_nc4_get_vara_double returned error
serbinsh commented 7 years ago

OK, not really sure what the deal is but it seems fairly spotty

2017-09-13 09:26:18 INFO   [PEcAn.logger::logger.info] :
   Downloading
   /data/Model_Output/pecan.output/dbfiles/CRUNCEP_site_2-3//CRUNCEP.2002.nc
2017-09-13 09:26:18 INFO   [PEcAn.logger::logger.info] :
   https://thredds.daac.ornl.gov/thredds/dodsC/ornldaac/1220/mstmip_driver_global_hd_climate_tair_2002_v1.nc4
[1] "ncvar_def: entering"
[1] "ncvar_def: prec= float"
[1] "ncvar_def: making ncvar object for var air_temperature"
2017-09-13 09:26:24 INFO   [PEcAn.logger::logger.info] :
   https://thredds.daac.ornl.gov/thredds/dodsC/ornldaac/1220/mstmip_driver_global_hd_climate_lwdown_2002_v1.nc4
[1] "ncvar_def: entering"
[1] "ncvar_def: prec= float"
[1] "ncvar_def: making ncvar object for var surface_downwelling_longwave_flux_in_air"
2017-09-13 09:26:30 INFO   [PEcAn.logger::logger.info] :
   https://thredds.daac.ornl.gov/thredds/dodsC/ornldaac/1220/mstmip_driver_global_hd_climate_press_2002_v1.nc4
Error in Rsx_nc4_get_vara_double: NetCDF: file not found
Var: press  Ndims: 3   Start: 0,161,200 Count: 1460,1,1
Error in ncvar_get_inner(ncid2use, varid2use, nc$var[[li]]$missval, addOffset,  :
  C function R_nc4_get_vara_double returned error

Do we need to add some sort of try/catch or delay? It seems like if there is a slight delay from THREDDS is crashes, perhaps?

serbinsh commented 7 years ago

Added a 3 second delay to the function:

      # This throws an error if file not found
      dap <- ncdf4::nc_open(dap_file, verbose=FALSE)
      Sys.sleep(3)

at line 101

That made it work for 2 years

-rw-rw-r--. 1 apache test 59760 Sep 13 09:36 CRUNCEP.2001.nc
-rw-rw-r--. 1 apache test 59760 Sep 13 09:37 CRUNCEP.2002.nc
serbinsh commented 7 years ago

OK, so there definitely is an issue, at least on modex, where a single year crashes but if I keep trying it will then work

[1] "ncvar_def: making ncvar object for var air_temperature"
2017-09-13 09:38:37 INFO   [PEcAn.logger::logger.info] :
   https://thredds.daac.ornl.gov/thredds/dodsC/ornldaac/1220/mstmip_driver_global_hd_climate_lwdown_1999_v1.nc4
Error in Rsx_nc4_get_vara_double: NetCDF: file not found
Var: time  Ndims: 1   Start: 0 Count: 1460
Error in ncvar_get_inner(d$dimvarid$group_id, d$dimvarid$id, default_missval_ncdf4(),  :
  C function R_nc4_get_vara_double returned error
>  download.CRUNCEP(site_id=2000000003, lat.in=9.280225, lon.in=-79.975527, model=NULL, scenario=NULL, ensemble_member=NULL, overwrite=TRUE, outfolder='/data/Model_Output/p2017-09-13 09:38:46 INFO   [PEcAn.logger::logger.info] : 01/01', end_date='2002/12/31')
   Downloading
   /data/Model_Output/pecan.output/dbfiles/CRUNCEP_site_2-3//CRUNCEP.1999.nc
2017-09-13 09:38:46 INFO   [PEcAn.logger::logger.info] :
   https://thredds.daac.ornl.gov/thredds/dodsC/ornldaac/1220/mstmip_driver_global_hd_climate_tair_1999_v1.nc4
Error in Rsx_nc4_get_vara_double: NetCDF: file not found
Var: time  Ndims: 1   Start: 0 Count: 1460
Error in ncvar_get_inner(d$dimvarid$group_id, d$dimvarid$id, default_missval_ncdf4(),  :
  C function R_nc4_get_vara_double returned error
>  download.CRUNCEP(site_id=2000000003, lat.in=9.280225, lon.in=-79.975527, model=NULL, scenario=NULL, ensemble_member=NULL, overwrite=TRUE, outfolder='/data/Model_Output/p2017-09-13 09:38:49 INFO   [PEcAn.logger::logger.info] : 01/01', end_date='2002/12/31')
   Downloading
   /data/Model_Output/pecan.output/dbfiles/CRUNCEP_site_2-3//CRUNCEP.1999.nc
2017-09-13 09:38:49 INFO   [PEcAn.logger::logger.info] :
   https://thredds.daac.ornl.gov/thredds/dodsC/ornldaac/1220/mstmip_driver_global_hd_climate_tair_1999_v1.nc4
[1] "ncvar_def: entering"
[1] "ncvar_def: prec= float"
[1] "ncvar_def: making ncvar object for var air_temperature"
2017-09-13 09:38:59 INFO   [PEcAn.logger::logger.info] :
   https://thredds.daac.ornl.gov/thredds/dodsC/ornldaac/1220/mstmip_driver_global_hd_climate_lwdown_1999_v1.nc4
serbinsh commented 7 years ago

@robkooper @mdietze @ashiklom @jsimkins2 Any thoughts on this? Run-down, if I re-run it will eventually pull the data....but in some cases it might take a few attempts....there seems to be some issue with nc_open getting a delay via thredds and then it fails saying the file isn't there, when it actually is.....perhaps that is just a modex issue due to the firewall?

But if I keep re-running it seems it will eventually get the data that it was failing on

Perhaps we had some robust error handling to try and restart a few times before ultimately failing if indeed THREDDS is down?

araiho commented 7 years ago

I was also getting this error on the VM. And when I reran from rstudio I still got the same error.

tonygardella commented 7 years ago

Just saw this. We've run into this before here: #1546 and we would just rerun it and it would download without an error, making it hard to replicate and figure out what's wrong. We thought it could be an issue specific to our machine. Good to know it's not.

I agree with creating some robust error handling.

ankurdesai commented 7 years ago

I think it has as much to do with the DOE THREDDS server either doing something like limiting max number of connections or preventing rapid number of calls to the server to prevent denial of service attacks or something. I know @jsimkins2 has had communication in the past with them about non reliable calls. I think the function isn't doing anything wrong.

One idea would be to add the sleep, detect error, and try three times before giving up?

serbinsh commented 7 years ago

@ankurdesai I agree with this assessment. I already add a Sys.sleep, which did help but I want to add a 10 retry to the function to give it enough chances to get a successful download.

araiho commented 7 years ago

@ankurdesai @serbinsh Does that mean there's a better time of day to try doing it? Like should I be downloading at night instead of in the middle of the day?

serbinsh commented 7 years ago

@araiho Not sure...I think it is an IT issue really. As Ankur mentioned, servers don't like rapid multi file requests as it can look like a DoS attack

serbinsh commented 7 years ago

What about something like:

retry <- function(expr, isError=function(x) "try-error" %in% class(x), maxErrors=5, sleep=0) {
  attempts = 0
  retval = try(eval(expr))
  while (isError(retval)) {
    attempts = attempts + 1
    if (attempts >= maxErrors) {
      msg = sprintf("retry: too many retries [[%s]]", capture.output(str(retval)))
      PEcAn.logger::logger.warn(msg)
      stop(msg)
    } else {
      msg = sprintf("retry: error in attempt %i/%i [[%s]]", attempts, maxErrors, 
                    capture.output(str(retval)))
      PEcAn.logger::logger.warn(msg)
      #warning(msg)
    }
    if (sleep > 0) Sys.sleep(sleep)
    retval = try(eval(expr))
  }
  return(retval)
}

where you call it as

dap <- retry(ncdf4::nc_open(dap_file, verbose=FALSE),maxErrors=10, sleep=2)
serbinsh commented 7 years ago

Produces something like

> retry(ncdf4::nc_open('https://thredds.daac.ornl.gov/thredds/dodsC/ornldaac/1220/mstmip_driver_global_hd_climate_lwdown_1999_v1.Error in Rsx_nc4_get_vara_double: NetCDF: file not found
Var: time  Ndims: 1   Start: 0 Count: 1460
Error in ncvar_get_inner(d$dimvarid$group_id, d$dimvarid$id, default_missval_ncdf4(),  :
  C function R_nc4_get_vara_double returned error
2017-09-13 10:50:17 WARN   [#13: PEcAn.logger::logger.warn] :
   retry: error in attempt 1/10 [[Class 'try-error' atomic [1:1] Error in
   ncvar_get_inner(d$dimvarid$group_id, d$dimvarid$id,
   default_missval_ncdf4(), : ]] retry: error in attempt 1/10 [[ C function
   R_nc4_get_| __truncated__]] retry: error in attempt 1/10 [[ ..- attr(*,
   "condition")=List of 2]] retry: error in attempt 1/10 [[ .. ..$ message:
   chr "C function R_nc4_get_vara_double returned error"]] retry: error in
   attempt 1/10 [[ .. ..$ call : language
   ncvar_get_inner(d$dimvarid$group_id, d$dimvarid$id,
   default_missval_ncdf4(), verbose = verbose)]] retry: error in attempt
   1/10 [[ .. ..- attr(*, "class")= chr [1:3] "simpleError" "error"
   "condition"]]
File https://thredds.daac.ornl.gov/thredds/dodsC/ornldaac/1220/mstmip_driver_global_hd_climate_lwdown_1999_v1.nc4 (NC_FORMAT_CLASSIC):

     6 variables (excluding dimension variables):
        int crs[]
            semi_major_axis: 6370997
            grid_mapping_name: latitude_longitude
            inverse_flattening: 0
        double lat_bnds[nv,lat]
            _ChunkSizes: 360
             _ChunkSizes: 2
        double lon_bnds[nv,lon]
            _ChunkSizes: 720
             _ChunkSizes: 2
        double time_bnds[nv,time]
            _ChunkSizes: 1
             _ChunkSizes: 2
        float lwdown[lon,lat,time]
            cell_methods: area:mean time:sum
            grid_mapping: crs
            long_name: Incoming_Long_Wave_Radiation
            units: W/m2
            missing_value: -999
            _FillValue: -999
            _ChunkSizes: 1
             _ChunkSizes: 360
             _ChunkSizes: 720
        int mask[lon,lat]
            _ChunkSizes: 360
             _ChunkSizes: 720

     4 dimensions:
        time  Size:1460   *** is unlimited ***
            bounds: time_bnds
            long_name: time
            units: days since 1700-01-01T00:00:00Z
            standard_name: time
            calendar: gregorian
            _ChunkSizes: 1
        lat  Size:360
            units: degrees_north
            standard_name: latitude
            bounds: lat_bnds
            _ChunkSizes: 360
        lon  Size:720
            units: degrees_east
            standard_name: longitude
            bounds: lon_bnds
            _ChunkSizes: 720
        nv  Size:2

    4 global attributes:
        model_name: STOMATE-SECHIBA-LPJ
        history: Tue Mar  1 10:57:59 2011: ncks -d time_counter,143176,144635 ncep9_lwdown_1901_2010_2D.nc cruncepV4_leap_lwdown_1999.nc
        Conventions: CF-1.4
        DODS_EXTRA.Unlimited_Dimension: time
Warning message:
In eval(expr) : restarting interrupted promise evaluation
>
serbinsh commented 7 years ago

Seems to be working

2017-09-13 11:12:28 INFO   [PEcAn.logger::logger.info] :
   https://thredds.daac.ornl.gov/thredds/dodsC/ornldaac/1220/mstmip_driver_global_hd_climate_lwdown_2000_v1.nc4
Error in Rsx_nc4_get_vara_double: NetCDF: file not found
Var: time  Ndims: 1   Start: 0 Count: 1464
Error in ncvar_get_inner(d$dimvarid$group_id, d$dimvarid$id, default_missval_ncdf4(),  :
  C function R_nc4_get_vara_double returned error
In addition: Warning message:
In eval(expr) : restarting interrupted promise evaluation
2017-09-13 11:12:29 WARN   [PEcAn.logger::logger.warn] :
   retry: error in attempt 1/10 [[Class 'try-error' atomic [1:1] Error in
   ncvar_get_inner(d$dimvarid$group_id, d$dimvarid$id,
   default_missval_ncdf4(), : ]] retry: error in attempt 1/10 [[ C function
   R_nc4_get_| __truncated__]] retry: error in attempt 1/10 [[ ..- attr(*,
   "condition")=List of 2]] retry: error in attempt 1/10 [[ .. ..$ message:
   chr "C function R_nc4_get_vara_double returned error"]] retry: error in
   attempt 1/10 [[ .. ..$ call : language
   ncvar_get_inner(d$dimvarid$group_id, d$dimvarid$id,
   default_missval_ncdf4(), verbose = verbose)]] retry: error in attempt
   1/10 [[ .. ..- attr(*, "class")= chr [1:3] "simpleError" "error"
   "condition"]]

Its handling hiccups in the function and continuing to run, so far

araiho commented 7 years ago

@serbinsh where did you put that retry function?

serbinsh commented 7 years ago

Here is an example where it retry several times

In eval(expr) : restarting interrupted promise evaluation
2017-09-13 11:12:55 WARN   [PEcAn.logger::logger.warn] :
   retry: error in attempt 5/10 [[Class 'try-error' atomic [1:1] Error in
   ncvar_get_inner(d$dimvarid$group_id, d$dimvarid$id,
   default_missval_ncdf4(), : ]] retry: error in attempt 5/10 [[ C function
   R_nc4_get_| __truncated__]] retry: error in attempt 5/10 [[ ..- attr(*,
   "condition")=List of 2]] retry: error in attempt 5/10 [[ .. ..$ message:
   chr "C function R_nc4_get_vara_double returned error"]] retry: error in
   attempt 5/10 [[ .. ..$ call : language
   ncvar_get_inner(d$dimvarid$group_id, d$dimvarid$id,
   default_missval_ncdf4(), verbose = verbose)]] retry: error in attempt
   5/10 [[ .. ..- attr(*, "class")= chr [1:3] "simpleError" "error"
   "condition"]]
2017-09-13 11:13:10 INFO   [PEcAn.logger::logger.info] :
   https://thredds.daac.ornl.gov/thredds/dodsC/ornldaac/1220/mstmip_driver_global_hd_climate_swdown_2000_v1.nc4
2017-09-13 11:13:19 INFO   [PEcAn.logger::logger.info] :
   https://thredds.daac.ornl.gov/thredds/dodsC/ornldaac/1220/mstmip_driver_global_hd_climate_uwind_2000_v1.nc4
2017-09-13 11:13:32 INFO   [PEcAn.logger::logger.info] :
   https://thredds.daac.ornl.gov/thredds/dodsC/ornldaac/1220/mstmip_driver_global_hd_climate_vwind_2000_v1.nc4
2017-09-13 11:13:43 INFO   [PEcAn.logger::logger.info] :
   https://thredds.daac.ornl.gov/thredds/dodsC/ornldaac/1220/mstmip_driver_global_hd_climate_qair_2000_v1.nc4

@araiho I put it in base/utils.R on my develop branch. I will send a PR soon for comments

ashiklom commented 5 years ago

Closing this, because I think it has been resolved by a combination @serbinsh 's retry code #1651 and by implementing an alternative, more reliable remote data access method (NetCDF subset, or <method>ncss</method> in the PEcAn XML) by me in #2153.