Benchmark and optimize IO with forcing datasets to ensure practical job execution times

robertbartel commented 2 weeks ago

As noted in #637, job execution times can be significantly lengthened by certain implementation details due to the IO characteristics of object-store-backed DMOD Datasets. Many practical job tests recently have been performed by manually bypassing use of this for forcing datasets, as these currently drastically slow performance. However, all of these recent tests have been CSV forcing datasets, and it is not entirely clear yet how much of the slow-down comes from the total amount of forcing data versus the number of individual files.

Analysis is first needed on how significant size versus file-count is when dealing with object-store-backed forcing dataset. From there, we need to sufficiently optimize the DMOD workflow implementation for reasonable job execution times. It is possible that will depend on #593, though hopefully, as with #654, a reasonable type of optimization can be introduced without a completely new dataset type.

aaraney commented 2 weeks ago

Do we have a ballpark threshold for what is considered acceptable? For example, running the same job that mounts the data from a non-network drive took t time, 1.5t is considered acceptable.

robertbartel commented 2 weeks ago

The initial number I had in my head was within 2x the running time when using local data, but this likely deserves more consideration and discussion.

christophertubbs commented 2 weeks ago

Would it be more appropriate to consider wall clock time more than relative time? Maybe a 'time-to-ctrl+c' metric. 3x the running time isn't bad at all if something's currently taking 2 minutes, but it wouldn't be reasonable if it takes an hour. Something taking 9 hours then taking 17 isn't necessarily as bad as taking 3 to 8 since the 9-17 will generally require operation out of hours.

Then again, it may be reasonable if it's just a matter of physics. For instance, I've been doing some AI work on my own time, and a 5x-10x slow down is acceptable on some machines compared to others because the others may have access to cuda. Sometimes you're just up a creek and have to accept that it's going to be a lot slower, but to what degree is a lot slower is it that moves the needle from "this is inconvenient" to "this isn't even worth doing"? I have models that generate excellent audio but take 15 minutes to generate sound for 4 paragraphs where other models provide sound that is demonstrably worse and not preferred at all, but only takes 45s. If the great model took 3-4 minutes, it'd still be worth using, but 8+ is a dealbreaker.

robertbartel commented 2 weeks ago

You raise good points, @christophertubbs. I think we are compelled to lean more toward relative times because the range of possible hardware on which DMOD may run is just too broad. A desktop/workstation is probably going to run more slowly than datacenter server, and that's before even considering clustering. But there may not be a (practical) way to do this completely scientifically, so we also shouldn't ignore wall clock times. We'll just have to consider more of the context surround runs when we look at that, and that is a little harder to concisely summarize and ballpark.

aaraney commented 2 weeks ago

@christophertubbs, yeah I think we are thinking of what you are calling wall clock time, just at a very crude level. Right, ideally we want a polynomial that describes the worst case acceptable performance as we scale up the number of catchments, number of modules per catchment, and simulation time (could also consider amount of forcing or the shape of the forcing). We would hope that we don't scale linearly, but instead once you pay some constant price, the runtime does not significantly increase with + 1 file read or file write.

NOAA-OWP / DMOD

Benchmark and optimize IO with forcing datasets to ensure practical job execution times #655