Weiming-Hu / AnalogsEnsemble

The C++ and R packages for parallel ensemble forecasts using Analog Ensemble
https://weiming-hu.github.io/AnalogsEnsemble/
MIT License
18 stars 5 forks source link

Question: Independent Search file size and IS vs SSE computing time #35

Closed lec170 closed 5 years ago

lec170 commented 5 years ago

These are just questions/observations and I am curious about your thoughts. No hurry and no worries if you don't have time to respond.

  1. Independent Search File size question: Under the new RAnEn v.3.2.5, the general file size of the Independent Search file size is around 1.9MB for analogs generated for one day whereas under earlier versions it was around 4MB for analogs generated for one day. (The AnEn IS I am using for comparison was computed under the most current version on 20190129. I think pre 3.2.x but it may be 3.1.x. ). I have compared the files and both seem like they have reasonable analog results (nothing missing, etc). Beyond the "Changelog" page or the "Issues" page, can you let me know if you changed something so that the filesize is smaller? I am just curious.

  2. Computing time: The IS took a little over an hour to generate 365 days using the IS search while it took about 14.75 hours to compute the SSE for 365 days. Does this seem appropriate to you? It seems like the code may be running faster now but I do not have numerical evidence. Just curious if you posted on this or had any thoughts to share.

Thanks!

Just in case you want to see the code used to configure/execute the AnEnIS and AnEnSSE.

# Objective:  Code to generate the analogs for various scenarios/cases. 
#             Generate AnEnIS output, AnEnSSE output, and configuration files for each.  
# Author:         Laura Clemente-Harding (laura@psu.edu) 
# Collaborators:  Guido Cervone, Weiming Hu
# Note! Weiming Hu is the author of the original RAnEn package. 

# Load libraries, source functions file
library(ncdf4); library(RAnEn)
source('~/geolab/projects/AnalogsSpatial/code/SSE_functions.R', echo=TRUE)

# Load ECMWF forecasts, analysis field, coordinates, calcuate WS and WD from U and V components
source('~/geolab/projects/AnalogsSpatial/code/SSE_loadBasicData.R', echo=TRUE)

# Generation Options 

generateAnEnSSE   <- FALSE # If TRUE, generates AnEnSSE
generateAnEnIS    <- TRUE  # If TRUE, generates AnEnIS
currentDate       <- "20190301"
# Save path and directory 
# savePath          <- "~/geolab_storage_V3/data/Analogs/AnEn-SSE/"   # 
savePath          <- "/Volumes/blackeye/geolab_storage_V3/data/Analogs/AnEn-SSE/"
saveDir           <- "operational_TEMP_20190301/"
operationalCase   <- T   # If TRUE, then it activates the operational case in the config generation files below
# operational       <- TRUE  # By default this is FALSE. 

# ANEN PARAMETERS
# Define variables here: 
members.size       <- 21
# Choose variable to be predicted (predictand) 
predictandParam   <- 3  # parameter 1 is WD; param 2 is WS; param 3 is Temperature
stations.ID       <- 1:2499 
weights           <- rep(1, dim(fcst.aligned)[1])     # <- c(1,1,0,0,0) #  "wdir","ws","2T","2DPT","MSLP"
verbosity         <- 3
preserve_mapping  <- TRUE 
extObs            <- FALSE 
test.start <- 730 # once looping, don't need to define this here
# test.end   <- 730 # once looping, don't need to define this here
search.start <- 1
search.end <- 730
AnEnGen.startDate <- test.start
AnEnGen.endDate   <- 1095 # end of the testing period, all the days to generate information for 

xs <- as.numeric(lon)
ys <- as.numeric(lat)
nx <- 51; ny <- 49   # Preset for Italy dataset 

# icounter1 <- 733 
# Generate AnEn output, one day at a time. 
for ( icounter1 in AnEnGen.startDate:AnEnGen.endDate ) {  # Now, 730; Eventually do this through to 1095 for the 3rd year. So this is training on the first year and testing on the second two years 
  print(paste("Generating AnEn for ", icounter1))
   test.start <- icounter1
  # test.end <- test.start# Generate the analogs for one day at a time. 

  # A sampling of 500 stations 
  # stations.ID <- stations_ID <- sort(sample(1:2499,500))
   if ( generateAnEnSSE ){
    config                     <- generateConfiguration('extendedSearch')
    config$observation_id      <- predictandParam
    config$forecasts           <- fcst.aligned   # changed    dim(fcst.aligned)=>  5 2499 1095    8
    config$forecast_times      <- fcst.times # NEW   
    config$flts                <- flts.subset# Want this to match with config$search_forecasts  # NEW-ISH bc search_flts is now flts 
                                  # 14Feb - So this no longer needs to be subset, 
                                  #    but we still subset it simply so it doesn't take the program as long (give it less to search through) 
    config$search_observations <- obsv  # search_observations[parameter, stations, time]
                                   # can constrain this to svae time 
    config$observation_times   <- obsv.times
    config$num_members         <- members.size
    config$weights             <- weights
    config$forecast_stations_x <- xs[stations.ID]
    config$forecast_stations_y <- ys[stations.ID]
    config$verbose             <- verbosity
    config$extend_observations <- extObs 
    # Set up test times to be compared
    config$test_times_compare   <- config$forecast_times[test.start] # One single point in time that the comparing starts from? 
    config$search_times_compare <- config$forecast_times[search.start:search.end]  # This means nothing if operational is changed to TRUE. However, operational is FALSE by default. 
    # Specific to SSE 
    config$preserve_search_stations  <- T  # Tells you waht the search stations are that you're looking into 
    config$preserve_similarity       <- T  # Tells you which statiosn were most similar 
    config$num_nearest               <- 8  # This is an option you can vary
    config$max_num_search_stations   <- 10  # Can decrease this if you want. # Can set this to be the same as number of nearest if/when using # nearest.
    # config$max_num_search_stations <- config$num_nearest 
    # config$distance <- 1

    # if ( operationalCase == TRUE ){
    #   config$operational  <- operationalCase
    #   # Additional parameters? 
    # }
    # 
    # Validate first before using 
    if ( validateConfiguration(config) == FALSE ){
      stop("Stop Program: Validation Failed ")
    }

    # w/ search extension
    AnEn <- generateAnalogs(config)

    # Save the AnEn 
    fname.SSE <- print(paste("AnEn_SSE_nn-",config$num_nearest, "_NumEns_", config$num_members,"_train_",search.start,"-",
                             search.end,"_testday_",test.start,"_op",operationalCase,sep=""))
    save(AnEn, file = print(paste(savePath,saveDir,fname.SSE,".Rdata", sep = "")))

    if ( generateAnEnSSE && test.start == 731 ){
      save(config, AnEn, file= print(paste(savePath,"config_AnEnSSE_",currentDate,"_day_",test.start,".Rdata", sep="")) )
    }
    rm(AnEn,fname.SSE)
  } # Ends if statement for AnEnSSE 

  # AnEn IS ( without search space extension )
  if ( generateAnEnIS ){
    config2                     <- generateConfiguration('independentSearch')
    config2$observation_id      <- predictandParam
    config2$forecasts           <- fcst.aligned   # changed    dim(fcst.aligned)=>  5 2499 1095    8
    config2$forecast_times      <- fcst.times # NEW   
    config2$flts                <- flts.subset# Want this to match with config$search_forecasts  # NEW-ISH bc search_flts is now flts 
    # 14Feb - So this no longer needs to be subset, 
    #    but we still subset it simply so it doesn't take the program as long (give it less to search through) 
    config2$search_observations <- obsv  # search_observations[parameter, stations, time]
    # can constrain this to svae time 
    config2$observation_times   <- obsv.times
    config2$num_members         <- members.size
    config2$weights             <- weights
    config2$forecast_stations_x <- xs[stations.ID]
    config2$forecast_stations_y <- ys[stations.ID]
    config2$verbose             <- verbosity
    config2$extend_observations <- extObs 
    # Set up test times to be compared
    config2$test_times_compare   <- config2$forecast_times[test.start]
    config2$search_times_compare <- config2$forecast_times[search.start:search.end]  # This means nothing if operational is changed to TRUE. However, operational is FALSE by default. 

    config2$preserve_mapping    <- preserve_mapping
    config2$verbose             <- verbosity

    # Validate first before using 
    validateConfiguration(config2)
    # Generate analogs 
    AnEn.ind <- generateAnalogs(config2)

    fname.nonSSE <- print(paste("AnEn_IS_NumEns_", config2$num_members,"_train_",search.start,"-",
                                search.end,"_testday_",test.start,"_op",operationalCase, sep=""))
    save(AnEn.ind, file = print(paste(savePath,saveDir,fname.nonSSE,".Rdata", sep = "")))

    # Save configuration file for whichever AnEn (IS or SSE) was generated 

    if ( generateAnEnIS && test.start == 731 ){
      save(config2, AnEn.ind, file = print(paste(savePath,"config2_AnEnIS_",currentDate,"_day_",test.start,".Rdata", sep="")) )
    }

    rm(AnEn.ind,fname.nonSSE)
  } # Ends T/F for AnEn IS generation 

} # End of anen generation calculator 
Weiming-Hu commented 5 years ago

Hi Laura,

There is no intended optimization for the computing time and the file sizes in the latest release. Although I'm not sure about the reasons for the computing time differences, you should certainly do something to check the file size difference. Please try to save results from both versions, and see whether the results are the same, and if not, whether the differences make sense to you.

Weiming-Hu commented 5 years ago

Issue closed due to inactivity.