James-Thorson-NOAA / VAST

Spatio-temporal analysis of univariate or multivariate data, e.g., standardizing data for multiple species or stages

http://www.FishStats.org

GNU General Public License v3.0

124 stars 54 forks source link

Using fit_model in a loop without rebuilding the extrapolation grid each time? #248

Closed bchasco closed 3 years ago

bchasco commented 4 years ago

I'm using R 3.6.3, VAST 3.4.0 and FishStatsUtils 2.6.0. I recognize these are out of date, but it took me a while to get these versions working on my machine so I'm reluctant to update right now.

The issue is, I am looping over several models using the fit_model wrapper. For instance, for(i in 1:n){ fit_model(settings_i) }

Each model uses exactly the same spatial data, so I'd like to avoid making the extrapolation grid each time because it is very time consuming. Is it possible to just use the fit_model wrapper with the same extrapolation grid each time while changing the RhoConfig and FieldConfig settings? I see a Save_Results argument for make_spatial_info, but I don't see how to easily map that tagged list to the fit_model wrapper. I'm happy to just look someone's code in a supplemental material if you can point me in the right direction.

Thanks, Brandon

James-Thorson-NOAA commented 4 years ago

Why is making the extrapolation-grid so slow? I assume this is for some Region = "Other" option, have you instead tried using a lower resolution extrapolation-grid, perhaps by supplying a shapefile using Region = [directory with shapefile]...?

Cole-Monnahan-NOAA commented 4 years ago

@bchasco I find that if I don't specify a working_dir in fit_model that it won't use the saved one b/c it's not in the same directory. (A bug). Try that a as a simple fix.

bchasco commented 4 years ago

Thanks for the prompt response, Jim and Cole.

Here is my build (below). It seems simple enough, but it takes about five minutes to build the extrapolation grid after going through 100 iterations. When I run the VAST simple example for the Bering Sea "All_strata" it builds the extrapolation grid in seconds. I've also just tried using just a single strata with no limits but it still takes quite a while. Is there an INLA flag that I'm missing?

This is what happens when I use the "california_current" region with no strata limits.

############ For the UTM conversion, automatically detected zone 10.

Reducing extrapolation-grid from 45628 to 2000 cells for Region(s): california_current

Num=1 Current_Best=Inf New=448378.2 Num=2 Current_Best=448378.2 New=445883.9 ... ...

Based on the following code,

n_x <- 175 settings <- make_settings( n_x = n_x ,Region = "california_current" ,purpose = "index2"

,strata.limits = strata.limits

      ,FieldConfig = FieldConfig
      ,RhoConfig = RhoConfig
      # ,OverdispersionConfig = OverdispersionConfig
      ,ObsModel = ObsModel
      #,knot_method = "samples"
      ,bias.correct = FALSE
      #,Options = Options

)

      fit <- tryCatch(fit_model(settings = settings
                                ,Lat_i = raw$Lat
                                ,Lon_i = raw$Lon
                                ,t_i = raw$Year
                                # ,c_i = c_iz
                                ,b_i =  b_i #Number of squid captured.
                                ,a_i = a_i
                                # ,v_i = rep(0,nrow(raw))
                                ,silent = TRUE
                                ,Q_ik = as.matrix(Q_ik)
                                ,working_dir = getwd()
                                # ,formula=formula
                                # ,covariate_data=covariate_data
      ),error=function(e) NULL)

James-Thorson-NOAA commented 4 years ago

Sorry, I get what's happening now. You're using purpose="index2", which by default uses n_g = 2000. This is a new feature that's designed to make the epsilon bias-correction more practical (because its time to run scales more than linearly with n_g).

You could try setting n_g = Inf in make_settings(.), which then defaults to the full extrapolation-grid. This will skip the kmeans step during make_extrapolation_info, but might then make the model run slower elsewhere.

I haven't currently built in a backdoor to save time in that reduction of the extrapolation-grid. I'm happy to spend some time thinking about how to do so though.

bchasco commented 4 years ago

Thanks for catching that, Jim. No need to devote any bandwidth to it.

The warning from fit_model says something about using "index2" is recommended, but "index" is retained for backward compatibility. Changing it back to index works, but n_g = Inf does not work. The argument n_g does not appear to be implemented for FishStatUtils 2.6.0.

For the time being, I am happy with the "index" solution. Once I update the versions of VAST and FishStatsUtils, I'm happy to revisit the issue. Feel free to close the issue if there's nothing else to add.

Thanks, Brandon

James-Thorson-NOAA commented 4 years ago

sorry, just checked the docs and I meant max_cells in make_settings. Try max_cells=Inf, and feel free to close the issue if that's sufficient for now.

bchasco commented 4 years ago

Thanks, B

Cole-Monnahan-NOAA commented 3 years ago

I was thinking about this issue a bit. From a user perspective it is often slower to calculate the extrapolation grid than it is to fit the model, in simple cases.

Why was the choice of max_cells=2000 made for the default? Doesn't that break backwards compatibility, where old calls would use the full resolution region?

Also, why do you need to run kmeans alg again if max_cells is less than the number of extrapolation point? I'm surprised that's run twice, once in make_extrapolation_info and once in make_spatial_info. Aren't the same spatial cells used for the data and the extrapolation points?

James-Thorson-NOAA commented 3 years ago

There's a couple of pieces here.

max_cells=2000 is the default for purpose="index2" (and perhaps some other "purposes"). I break backwards compatibility e.g. when updating default for index-standardization to purpose="index3", but its backwards compatible conditional on the user-supplied purpose.
the time for epsilon bias-correction scales worse than linearly with the number of extrapolation-grid cells. So any purpose that uses bias-correction (e.g., purpose="index2") can benefit from fewer extrapolation-grid cells to be fast enough for operations.
there's two separate resolutions; the number of extrapolation-grid cells (max_cells) and the number of knots (n_x). So you might have to run kmeans twice, once to reduce from the default to a new extrapolation-grid cell resolution, and again when selecting the number of knots. Its helpful to control these resolutions separately, e.g., when needing to make high resolution predictions (for EFH or habitat maps) but using fewer knots than that resolution.

But yes, it would be great to come up with more efficient ways of doing the extrapolation-grid resolution stage, which is currently super-slow!

Cole-Monnahan-NOAA commented 3 years ago

Hi Jim,

Good point at (1) I should have remembered that. Do you have it written out anywhere? The help file for make_settings doesn't include any details (and in fact is missing "Index2" as an option).

(3) I was really confused by this line: https://github.com/James-Thorson-NOAA/FishStatsUtils/blob/development/R/make_extrapolation_info.R#L227 but now I see that the defaults have packaged the extrapolation region grid and so you don't need to recalculate it unless the user restricts the resolution via max_cells. The other piece I'm missing is where in the code do you associate each extrapolation point with the knots from the model grid? Each call to Kmeans will return an index within each of the two resolutions but I can't find in the code how you link them.

James-Thorson-NOAA commented 3 years ago

Good point at (1) I should have remembered that. Do you have it written out anywhere? The help file for make_settings doesn't include any details (and in fact is missing "Index2" as an option).

Probably not written anywhere, but IMO should be in the doxygen docs for make_settings. While you're editing that pull request, are you interested in spending a few minutes editing that doxygen docs too?

The other piece I'm missing is where in the code do you associate each extrapolation point with the knots from the model grid? Each call to Kmeans will return an index within each of the two resolutions but I can't find in the code how you link them.

The linking is done via a projection matrix A. This is a sparse matrix which I pass to the CPP in triplet form. When fine_scale=TRUE there's (at most) three knots associated with each extrapolation-grid point (via A_gz) or each observation (via A_iz). Those projection matrices are built I think in make_spatial_info and then loaded into requested format in make_data

Cole-Monnahan-NOAA commented 3 years ago

Sorry to belabor this.. I want to understand better and maybe think about how to improve this workflow.

So if finescale is off none of this matters. However with it on and max_cells=Inf then the A matrix projects all extrapolation points down onto the model grids via A_gs calculated here: https://github.com/James-Thorson-NOAA/FishStatsUtils/blob/development/R/make_spatial_info.R#L202 where loc_g is the extra points. Now, if max_cells is small enough, then you downsample the extrapolation points by grouping them together via nearest neighbor cells so there are max_cells extrapolation cells and those are used in place of the number of points. Hence reducing the dimensionality of A_gs from the number of points to max_cells. Basically you replace points by groups of points (cells), assuming the center location of each. Is it correct that a naive, overly-simple alternative is to subsample the extrapolation points directly via something like sample(., size=max_cells)? Presumably your way keeps things more uniform or deals with cell areas better?

This NN call requires the extra call to make_kmeans and is what is slow and what the user will really notice when running models without changing settings. The part that I don't get is that when plotting with say max_cells=2000 you can see the drop in resolution. So are you potting the n_g extrapolation cells instead of points?

If this is correct, how about adding an argument to make_kmeans that is like purpose="data" by default and optionally purpose="extrapolation". Then you can have two .Rdata files, one for the data (current, unchanged) and a new one (e.g., Kmeans_extrapolation-n_g.Rdata) which then gets tested here https://github.com/James-Thorson-NOAA/FishStatsUtils/blob/development/R/make_kmeans.R#L55 based on the input purpose. We'd have to change the save file too of course https://github.com/James-Thorson-NOAA/FishStatsUtils/blob/development/R/make_kmeans.R#L70. But that seems relatively simple to implement and would save a lot of time really.

Thoughts?

James-Thorson-NOAA commented 3 years ago

I'll think a bit; I agree that we should do something to speed up this step.

But you also asked why max_cells changes plot resolution, which I'll answer now. The reason is I use package raster for plots, which takes care of screen resolution and rotation issues when plotting a surface. But converting from extrapolation grid to raster requires specifying the raster resolution. This is done during plot calls, and has no effect on the model fitting. The argument is called 'n_cells' I think (away from laptop to double check)

On Monday, December 14, 2020, Cole Monnahan notifications@github.com wrote:

Sorry to belabor this.. I want to understand better and maybe think about how to improve this workflow.

So if finescale is off none of this matters. However with it on and max_cells=Inf then the A matrix projects all extrapolation points down onto the model grids via A_gs calculated here: https://github.com/James- Thorson-NOAA/FishStatsUtils/blob/development/R/make_spatial_info.R#L202 where loc_g is the extra points. Now, if max_cells is small enough, then you downsample the extrapolation points by grouping them together via nearest neighbor cells so there are max_cells extrapolation cells and those are used in place of the number of points. Hence reducing the dimensionality of A_gs from the number of points to max_cells. Basically you replace points by groups of points (cells), assuming the center location of each. Is it correct that a naive, overly-simple alternative is to subsample the extrapolation points directly via something like sample(., size=max_cells)? Presumably your way keeps things more uniform or deals with cell areas better?

This NN call requires the extra call to make_kmeans and is what is slow and what the user will really notice when running models without changing settings. The part that I don't get is that when plotting with say max_cells=2000 you can see the drop in resolution. So are you potting the n_g extrapolation cells instead of points?

If this is correct, how about adding an argument to make_kmeans that is like purpose="data" by default and optionally purpose="extrapolation". Then you can have two .Rdata files, one for the data (current, unchanged) and a new one (e.g., Kmeans_extrapolation-n_g.Rdata) which then gets tested here https://github.com/James-Thorson-NOAA/FishStatsUtils/ blob/development/R/make_kmeans.R#L55 based on the input purpose. We'd have to change the save file too of course https://github.com/James- Thorson-NOAA/FishStatsUtils/blob/development/R/make_kmeans.R#L70. But that seems relatively simple to implement and would save a lot of time really.

Thoughts?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/James-Thorson-NOAA/VAST/issues/248#issuecomment-744763601, or unsubscribe https://github.com/notifications/unsubscribe-auth/AL62VMVOHYMKORFXK4F7GZTSU2J5VANCNFSM4QCEK5WA .

-- James Thorson

Program leader, Habitat and Ecological Processes Research (HEPR) Alaska Fisheries Science Center, NMFS Affiliate Faculty, University of Washington and Oregon State University

The contents of this message are mine personally and do not necessarily reflect any position of NOAA.

Cole-Monnahan-NOAA commented 3 years ago

OK I went ahead and implemented this change I proposed. See it here: https://github.com/James-Thorson-NOAA/FishStatsUtils/tree/streamline_kmeans_calls

The first time you run it, it calculates and saves the Kmeans file to the working directory with name "Kmeans_extrapolation-.Rdata". Any subsequent runs with the same max_cells value will then read that in with console output like:

### Making extrapolation-grid
Using strata 1
For the UTM conversion, automatically detected zone 2.
# Reducing extrapolation-grid from 58049 to 2000 cells for Region(s): eastern_bering_sea
Loaded from c:/Users/cole.monnahan/Work/Research/VAST/extrap_test//Kmeans_extrapolation-2000.RData

### Making spatial information
Loaded from c:/Users/cole.monnahan/Work/Research/VAST/extrap_test//Kmeans-100.RData

I tested it with the following script and confirmed that MLEs are identical with max_cells of 2000 (so it calls make_kmeans) and Inf where it skips it. If this is something you're interested in I can test it more thoroughly but it appears to be working, should be backwards compatible, and only requires adding a single argument to make_kmeans (hidden from user) and a few small tweaks in functions that call it.

I think this would be particularly helpful for a simulation where the user could copy the Kmeans files into each replicate directory and skip this redundant step.

As an aside I would also like to clean up the console output of the make_kmeans function by printing fewer of the iterations. How about print first and last, plus maybe any that are an improvement?

### Test the new way of doing extrapolation kmeans calcs
### (skipping them)

remove.packages('FishStatsUtils')
## devtools::install("c:/Users/cole.monnahan/FishStatsUtils")
devtools::install_github('James-Thorson-NOAA/FishStatsUtils', ref='streamline_kmeans_calls')

## Setup really simple exmaple for testing
library(VAST)
example <- load_example( data_set="EBS_pollock" )
dat <- subset(example$sampling_data, Year==2013)
settings <- make_settings(n_x=100, Region=example$Region,
                          max_cells=2000,
                          purpose="index2", bias.correct=FALSE )
settings$FieldConfig[2, 1:2] <- 0

## Delete existing kmeans files
trash <- file.remove(list.files(getwd(), pattern='Kmeans'))
fit0 = fit_model(settings=settings,
                Lat_i=dat$Lat, Lon_i=dat$Lon, t_i=dat$Year,
                b_i=dat$Catch_KG, a_i=dat$AreaSwept_km2)
## Rerun which uses the newly produced Kmeans file for the
## extrapolation region
fit1 = fit_model(settings=settings,
                Lat_i=dat$Lat, Lon_i=dat$Lon, t_i=dat$Year,
                b_i=dat$Catch_KG, a_i=dat$AreaSwept_km2)

all.equal(fit0$ParHat, fit1$ParHat)

## If using max_cells=Inf it works like before
settings$max_cells <- Inf
fit2 = fit_model(settings=settings,
                Lat_i=dat$Lat, Lon_i=dat$Lon, t_i=dat$Year,
                b_i=dat$Catch_KG, a_i=dat$AreaSwept_km2)
## Rerun which uses the newly produced Kmeans file for the
## extrapolation region
fit3 = fit_model(settings=settings,
                Lat_i=dat$Lat, Lon_i=dat$Lon, t_i=dat$Year,
                b_i=dat$Catch_KG, a_i=dat$AreaSwept_km2)
all.equal(fit2$ParHat, fit3$ParHat)

James-Thorson-NOAA commented 3 years ago

Cole,

Thanks for looking into this! I like this approach, and really appreciate the help. A few thoughts:

I think it's worth renaming the Kmeans.RData made when identifying the location of knots to now be "Kmeans_knots-.Rdata" to be consistent and clear with the new "Kmeans_extrapolation-.Rdata". Agreed? Are you willing to do that too while poking around under the hood?
I like the idea of simplifying the terminal output from make_kmeans as you say. However, I recommend giving evenly spaced updates between first and last (e.g., code snippet here -- I get some percentage of questions about steps that take too long when there's no message on the terminal
could you use a different argument name than purpose like kmeans_purpose? I try to avoid duplicating argument names across functions unless they have the same meaning.

If you do that PR I'd be happy to merge into development branch and test myself a bit prior to next numeric release.

Cole-Monnahan-NOAA commented 3 years ago

OK I renamed the files, and argument input. I also cleaned up the output and pushed to my branch. It now looks like this:

### Making extrapolation-grid
Using strata 1
For the UTM conversion, automatically detected zone 2.
# Reducing extrapolation-grid from 58049 to 2000 cells for Region(s): eastern_bering_sea
Using 100 iterations to find optimal extrapolation grid placement because no saved file found...
Iter=1: Current=1501553
Iter=10: Current=1491237 Proposed=1506299
Iter=20: Current=1491237 Proposed=1504962
Iter=30: Current=1491237 Proposed=1506741
Iter=40: Current=1491237 Proposed=1502818
Iter=50: Current=1491237 Proposed=1507757
Iter=60: Current=1491237 Proposed=1501056
Iter=70: Current=1491237 Proposed=1494576
Iter=80: Current=1491237 Proposed=1496737
Iter=90: Current=1491237 Proposed=1504365
Iter=100: Final=1491237 after 9 iterations
Results saved to c:/Users/cole.monnahan/Work/Research/VAST/extrap_test/Kmeans_extrapolation-2000.RData
 for subsequent runs by default (delete it to override)

### Making spatial information
Using 100 iterations to find optimal spatial knot placement because no saved file found...
Iter=1: Current=1698197
Iter=10: Current=1653616 Proposed=1667523
Iter=20: Current=1653616 Proposed=1669035
Iter=30: Current=1653616 Proposed=1679668
Iter=40: Current=1653616 Proposed=1722096
Iter=50: Current=1653616 Proposed=1656018
Iter=60: Current=1653616 Proposed=1673123
Iter=70: Current=1653616 Proposed=1690458
Iter=80: Current=1653616 Proposed=1685798
Iter=90: Current=1653616 Proposed=1673558
Iter=100: Final=1653616 after 2 iterations
Results saved to c:/Users/cole.monnahan/Work/Research/VAST/extrap_test/Kmeans_knots-100.RData
 for subsequent runs by default (delete it to override)

Where the last line tells us how many iterations it took to get the optimal. My guess from watching the output matrix-style is that 100 is far too large. This gives a bit of feedback but I can take it out if you don't like it. Here it's 2 and 9. This is only really an issue for really high resolution extrapolation regions I suppose but could be a way to further streamline (set to 10 instead of 100?).

One thing I haven't fully thought through is when the extrapolation kmeans needs to be recalculated. Presumably the typical workflow is to start with finescale off and n_x small. Is there any combination of steps where the file should be deleted but won't be? The only thing I can think of is if the user is using their own Region and changes resolution in the shapefile or something?