Open ChristopherTracey opened 6 years ago
@ChristopherTracey Do you have an idea of the most time consuming steps? I assume most of the time is in the reach group jackknife procedure?
Yes, it seems to take about 15 hours on that step for 62 reach groups.
Ran it again yesterday and it took 16 hours start to finish for those 62 groups.
I honestly hadn't run an aquatic model in a long time but just had a rerun to do today. It was a much smaller training set (7 groups, only 10 presence reaches), but it seemed to only take about 30s-1min per group.
If the majority of time is in the validation loop, the differences would likely be due to:
To speed it up we could:
The partial plots are also pretty slow (~3 min each), but would also be more time with more total reaches. These would also faster if we initially sampled the background subset.
The PA dataset (which includes everything in the watersheds draining into PA) has 149,277 which probably explains the difference.
I was about to open up another issue to make a change which will likely affect this. In order to prevent "overprediction" into major watersheds where a species does not occur, I would like to automatically subset the EnvVars by a HUC2 (or HUC4?) watershed based on the training data. This would drastically cut down on the number of background reaches and perhaps solve some of the problem.
That is a bigger difference than I expected, but it still seems a bit slow given it's about 2x our dataset.
We're dealing with the clipping out major watersheds post-model, but it probably is smarter to do it pre-model if we can come up with a failsafe way to automatically exclude certain HUC2/4s.
In our case since our training data is limited to within-state boundaries instead of major watersheds, we couldn't really expand the modelling area beyond the state (e.g. to the actual major watershed boundaries). But using [major watersheds boundaries clipped to state boundaries] makes sense.
based on the changes made in #29, its now running slightly faster (~3m per jackknife).
In (500133fd) the partial plots now sample the background (10% or 10,000 reaches, whichever is smaller) rather than using the full background set to create the plot. Tested and working in aqua_dev, improves the speed for this step without any noticeable changes to the plots.
Last run of a Pennsylvania based model took approximately 20 hours to run. Rumor is that Virginia models run much faster. We should figure out what's potentially going on.