Weiming-Hu / AnalogsEnsemble

The C++ and R packages for parallel ensemble forecasts using Analog Ensemble
https://weiming-hu.github.io/AnalogsEnsemble/
MIT License
18 stars 5 forks source link

Interface Help Following 3.2.1 release (Operational Search added) #30

Closed lec170 closed 5 years ago

lec170 commented 5 years ago

Following use of Parallel Ensemble help page (https://weiming-hu.github.io/AnalogsEnsemble/2019/02/12/operational-search.html) and binder (https://hub.mybinder.org/user/weiming-hu-analogsensemble-bmqvhvn1/notebooks/demo-3_operational-search.ipynb) documentation, 1) request help utilizing new commands added/revised with the inclusion of the Operational Search option 2) Suggest revision of user interface to reduce duplication and streamline use

lec170 commented 5 years ago

Update: I have been able to modify the configuration code properly. The validateConfiguration(config) does provide a positive result at this point. However, when I try to generate the analogs, my R session begins computing and then aborts when it is computing similarity matrices. Code below. Picture of the Rsession and the point it aborts is attached. rsession_fail_20190214

Here is the code below:

# Load libraries, source functions file
library(ncdf4); library(RAnEn)
source('~/geolab/projects/AnalogsSpatial/code/SSE_functions.R', echo=TRUE)

# Load Data
# Forecasts
nc.fcst.file <- '~/geolab_storage_V3/data/Analogs/ECMWF_Italy/ItalyFcst20180604.nc'
nc.fcst      <- nc_open(nc.fcst.file)
# Capture a small subset of the data 
fcst         <- ncvar_get(nc.fcst, 'Data',start = c(1,1,1,1), count = c(5,2499,1095,16))
# "10U","10V","2T","2DPT","MSLP"
lon <- ncvar_get(nc.fcst, "Lon")[1:dim(fcst)[2]]
lat <- ncvar_get(nc.fcst, "Lat")[1:dim(fcst)[2]]

# Convert u and v components into wind speed and wind direction 
fcst.calc   <- fcst
u.10.fcst   <- fcst[1,,,]
v.10.fcst   <- fcst[2,,,]
dir.10.fcst <- UVtoDir(u.10.fcst, v.10.fcst)
spd.10.fcst <- UVtoSpd(u.10.fcst, v.10.fcst)
# Replace
fcst.calc[1,,,] <- dir.10.fcst
fcst.calc[2,,,] <- spd.10.fcst

# "10WD","10WS","2T","2DPT","MSLP"
fcst.times <- ncvar_get(nc.fcst, "Times")  # range(as.POSIXct(times,origin="1970-01-01",tz="UTC"))
flts  <- ncvar_get(nc.fcst, "FLT",start = 1, count = 16)  # The real time 

# New: we should keep from 1 to 2. 
fcst.flt.to.keep <- c(seq(from = 1, to = dim(fcst)[4], by = 2))  # You can save time and memory by skipping and not feeding all the data. It can handle the missing FLT in the program. However, we subset bc it saves time. 
fcst.aligned     <-  fcst.calc[,,,fcst.flt.to.keep]
flts.subset      <- flts[fcst.flt.to.keep]

# Observations (Analysis Fields)
nc.analy.file <- '~/geolab_storage_V3/data/Analogs/ECMWF_Italy/ItalyAnalysis_new.nc'
nc.analysis   <- nc_open(nc.analy.file)
obsv          <- ncvar_get(nc.analysis, 'Data')
obsv.times     <- ncvar_get(nc.analysis, "Times")

# Calculate WS and WD from u and v components
obsv.calc   <- obsv
u.10.obsv   <- obsv[1,,]
v.10.obsv   <- obsv[2,,]
dir.10.obsv <- UVtoDir(u.10.obsv, v.10.obsv)
spd.10.obsv <- UVtoSpd(u.10.obsv, v.10.obsv)

obsv.calc[1,,] <- dir.10.obsv
obsv.calc[2,,] <- spd.10.obsv

# ANEN PARAMETERS
# Define variables here: 
members.size       <- 21
# Choose variable to be predicted (predictand) 
predictandParam   <- 3  # parameter 1 is WD; param 2 is WS; param 3 is Temperature
stations.ID       <- 1:2499 
weights           <- rep(1, dim(fcst.aligned)[1])     # <- c(1,1,0,0,0) #  "wdir","ws","2T","2DPT","MSLP"
verbosity         <- 5
preserve_mapping  <- TRUE 
extObs            <- FALSE 
# operational       <- TRUE  # By default this is FALSE. 
test.start <- 731
test.end   <- 1095
search.start <- 1
search.end <- 730

rm(nc.analysis,nc.fcst,u.10.obsv,v.10.obsv,dir.10.fcst,dir.10.obsv,nc.analy.file,nc.fcst.file,spd.10.fcst,spd.10.obsv)

xs <- as.numeric(lon)
ys <- as.numeric(lat)

nx <- 51; ny <- 49   # Preset for Italy dataset 
config                     <- generateConfiguration('extendedSearch')
config$observation_id      <- predictandParam
config$test_forecasts      <- fcst.aligned               
config$test_times          <- fcst.times # NEW   
config$search_forecasts    <- fcst.aligned
config$search_times        <- fcst.times
config$flts                <- flts.subset
config$search_observations <- obsv              
config$observation_times   <- obsv.times
config$num_members         <- members.size
config$weights             <- weights
config$test_stations_x     <- xs[stations.ID]
config$test_stations_y     <- ys[stations.ID]
config$search_stations_x   <- xs[stations.ID]
config$search_stations_y   <- ys[stations.ID]
config$preserve_mapping    <- preserve_mapping
config$verbose             <- verbosity
config$extend_observations <- extObs 
# Set up test times to be compared
config$test_times_compare <- config$search_times[test.start:test.end]
config$search_times_compare <- config$search_times[search.start:search.end]  # This means nothing if operational is changed to TRUE. However, operational is FALSE by default. 
# Could also call fcst.times instead but then it says "ERROR: Test times for comparison should be a numeric vector!"# [1] "Please use is.vector() and is.numeric() to check!
# And does not validate
# Specific to SSE 
config$preserve_search_stations  <- T  
config$preserve_similarity       <- T  
config$num_nearest               <- 8  
config$max_num_search_stations   <- 10  

# Validate first before using 
validateConfiguration(config)

# w/ search extension
AnEn <- generateAnalogs(config)
Weiming-Hu commented 5 years ago

Thank you Laura. I'm working on this now.

Weiming-Hu commented 5 years ago

Hi Laura. I'm debugging your code. I should be done shortly. Do you have some time to chat this afternoon?

Weiming-Hu commented 5 years ago

Hi Laura, on my Linux machine I have the following error message.

ERROR: Insufficient memory to resize similarity matrix to store 159806052000 double values!
Error in .generateAnalogs(configuration$test_forecasts, dim(configuration$test_forecasts),  : 
  std::bad_alloc

So I guess you are running too large of a domain.

Weiming-Hu commented 5 years ago

Hey Laura,

Please look at this from your code:

test.start <- 731
test.end   <- 1095
...
# Your test days are too many....
config$test_times_compare <- config$search_times[test.start:test.end]
Weiming-Hu commented 5 years ago

To make this particular script to run, I made two changes:

# Generate the advanced configuration if you are using 3.2.2
config                     <- generateConfiguration('extendedSearch', TRUE)

# Change the number of test days
test.end <- test.start

By the way, the size of the AnEn will be around 3.3 Gb because you save the similarity matrix, which will be around 3 Gb.

lec170 commented 5 years ago

Hi Laura. I'm debugging your code. I should be done shortly. Do you have some time to chat this afternoon?

Just seeing this. Can chat anytime tonight and am flexible this weekend.

Weiming-Hu commented 5 years ago

I'm not sure why R does not catch the memory exhaustion for you.

lec170 commented 5 years ago

Great, this helps. So I was confusing test_times_compare. I will now integrate this into my looping as I only generate one day at a time due to memory.

How extensive is the information in the similarity matrix? Maybe I will not save it due to space. That said, I will want the information for some of the analysis eventually and it takes time to generate a years worth of data. Suggestions?

lec170 commented 5 years ago

Have it running. There are some vector memory issues and I'm trying to resolve those. It seems that I simply have to decrease the amount of verbosity.

Weiming-Hu commented 5 years ago

Great, this helps. So I was confusing test_times_compare. I will now integrate this into my looping as I only generate one day at a time due to memory.

How extensive is the information in the similarity matrix?

The similarity matrix gives you all the similarity metric that has been computed. It is rather a huge one. The dimension of the similarity matrix [# of stations, # of test days, # of FLTs, # of search days x # of search stations, 3]. As you can see, this can be even larger when you are using SSE.

Maybe I will not save it due to space. That said, I will want the information for some of the analysis eventually and it takes time to generate a years worth of data. Suggestions?

Actually, now you are testing one day at a time, and you can also break down your spatial domain. For example, if you have 2499 stations, you can generate for 1000 stations at a time, and each generation only requires half of the memory. At last, you can aggregate them together. If you are using SSE, you might need to consider the ghost zone problem when subsetting your spatial domain.

lec170 commented 5 years ago

a) Ok. So the similarity matrix information is the same as previous AnEn builds. Previously, I could generate the AnEn IS or SSE for one day at a time (all 2499 grids) and output the similarity matrix. I want to be able to do this otherwise I will need to regenerate this every time I need more data and that becomes rather time consuming as it takes a little over a day to compute the analogs for a year. I will need to investigate why this is providing the vector memory exhausted error when analogs are being generated for one day at a time.

b) Yes, if need be, I can do this. Maybe you can explain what variations are in the memory allocation between v 2.2.1 and v 2.2.2.

Weiming-Hu commented 5 years ago

Issue solved. Memory exhaustion is caused by the configuration.