PNHP / Regional_SDM

Methods and collaboration for Species Distribution Modeling among Heritage Programs
4 stars 0 forks source link

investigate speed issues #24

Open ChristopherTracey opened 6 years ago

ChristopherTracey commented 6 years ago

Last run of a Pennsylvania based model took approximately 20 hours to run. Rumor is that Virginia models run much faster. We should figure out what's potentially going on.

dnbucklin commented 6 years ago

@ChristopherTracey Do you have an idea of the most time consuming steps? I assume most of the time is in the reach group jackknife procedure?

ChristopherTracey commented 6 years ago

Yes, it seems to take about 15 hours on that step for 62 reach groups.

ChristopherTracey commented 6 years ago

Ran it again yesterday and it took 16 hours start to finish for those 62 groups.

dnbucklin commented 6 years ago

I honestly hadn't run an aquatic model in a long time but just had a rerun to do today. It was a much smaller training set (7 groups, only 10 presence reaches), but it seemed to only take about 30s-1min per group.

If the majority of time is in the validation loop, the differences would likely be due to:

To speed it up we could:

The partial plots are also pretty slow (~3 min each), but would also be more time with more total reaches. These would also faster if we initially sampled the background subset.

ChristopherTracey commented 6 years ago

The PA dataset (which includes everything in the watersheds draining into PA) has 149,277 which probably explains the difference.

I was about to open up another issue to make a change which will likely affect this. In order to prevent "overprediction" into major watersheds where a species does not occur, I would like to automatically subset the EnvVars by a HUC2 (or HUC4?) watershed based on the training data. This would drastically cut down on the number of background reaches and perhaps solve some of the problem.

dnbucklin commented 6 years ago

That is a bigger difference than I expected, but it still seems a bit slow given it's about 2x our dataset.

We're dealing with the clipping out major watersheds post-model, but it probably is smarter to do it pre-model if we can come up with a failsafe way to automatically exclude certain HUC2/4s.

In our case since our training data is limited to within-state boundaries instead of major watersheds, we couldn't really expand the modelling area beyond the state (e.g. to the actual major watershed boundaries). But using [major watersheds boundaries clipped to state boundaries] makes sense.

ChristopherTracey commented 6 years ago

based on the changes made in #29, its now running slightly faster (~3m per jackknife).

dnbucklin commented 5 years ago

In (500133fd) the partial plots now sample the background (10% or 10,000 reaches, whichever is smaller) rather than using the full background set to create the plot. Tested and working in aqua_dev, improves the speed for this step without any noticeable changes to the plots.