investigate speed issues

PNHP / Regional_SDM

Methods and collaboration for Species Distribution Modeling among Heritage Programs

4 stars 0 forks source link

investigate speed issues #24

Open ChristopherTracey opened 6 years ago

ChristopherTracey commented 6 years ago

Last run of a Pennsylvania based model took approximately 20 hours to run. Rumor is that Virginia models run much faster. We should figure out what's potentially going on.

dnbucklin commented 6 years ago

@ChristopherTracey Do you have an idea of the most time consuming steps? I assume most of the time is in the reach group jackknife procedure?

ChristopherTracey commented 6 years ago

Yes, it seems to take about 15 hours on that step for 62 reach groups.

ChristopherTracey commented 6 years ago

Ran it again yesterday and it took 16 hours start to finish for those 62 groups.

dnbucklin commented 6 years ago

I honestly hadn't run an aquatic model in a long time but just had a rerun to do today. It was a much smaller training set (7 groups, only 10 presence reaches), but it seemed to only take about 30s-1min per group.

If the majority of time is in the validation loop, the differences would likely be due to:

number of total reaches (I think we have about 65000 in our dataset, PA probably has a bit more(?)
more variables in your analysis(?) We usually end up with 40-50 after subsetting out correlated, unimportant variables

To speed it up we could:

sample an initial subset of background reaches instead of leaving them all in. This is probably necessary if we start doing analyses over larger areas
reduce number of trees (1000 used in the validation now)

The partial plots are also pretty slow (~3 min each), but would also be more time with more total reaches. These would also faster if we initially sampled the background subset.

ChristopherTracey commented 6 years ago

The PA dataset (which includes everything in the watersheds draining into PA) has 149,277 which probably explains the difference.

I was about to open up another issue to make a change which will likely affect this. In order to prevent "overprediction" into major watersheds where a species does not occur, I would like to automatically subset the EnvVars by a HUC2 (or HUC4?) watershed based on the training data. This would drastically cut down on the number of background reaches and perhaps solve some of the problem.

dnbucklin commented 6 years ago

That is a bigger difference than I expected, but it still seems a bit slow given it's about 2x our dataset.

We're dealing with the clipping out major watersheds post-model, but it probably is smarter to do it pre-model if we can come up with a failsafe way to automatically exclude certain HUC2/4s.

In our case since our training data is limited to within-state boundaries instead of major watersheds, we couldn't really expand the modelling area beyond the state (e.g. to the actual major watershed boundaries). But using [major watersheds boundaries clipped to state boundaries] makes sense.

ChristopherTracey commented 6 years ago

based on the changes made in #29, its now running slightly faster (~3m per jackknife).

dnbucklin commented 5 years ago

In (500133fd) the partial plots now sample the background (10% or 10,000 reaches, whichever is smaller) rather than using the full background set to create the plot. Tested and working in aqua_dev, improves the speed for this step without any noticeable changes to the plots.