FireSpread Algorithm Performance Issues

wildintellect commented 2 months ago

Is your feature request related to a problem? Please describe. https://github.com/jmar0715/FireSpreadForecasting is running really slow, even with multiple CPUs the job is unlikely to finish with the amount of data and processing in a reasonable amount of time.

Describe the solution you'd like Find a solution to make the algorithm solve with the DPS time limits if possible. @jmar0715 is trying to get outputs to present for AGU in December.

Additional context @sujen1412 @wildintellect @jmar0715 did a debugging session. While actively monitoring DPS jobs we noticed the that file IO from S3 was very high and seemed to be the bottleneck during the model training which relied on lazy loading via xarray. We attempted to change the looping structure to load a smaller amount of data by iterating year by year, building the model up over loops. This didn't seem to work either.

Based on this we don't think that adding a GPU would benefit much at this point. cc: @weiji14 @omshinde

Most likely approach to pursue now is to look at the input netcdf files and optimize a better way to read the portion of data needed for the model training.

Ideas:

Kerchunk the netcdf files to allow for "cloud-optimized" data opening
reprocess the netcdf to Zarr or Geoparquet: If I understood these are really fire point detections and there's a lot of NAs in the data filling the data cube (4 dimensions)
?

cc: @chuckwondo @jsignell

chuckwondo commented 1 month ago

@jmar0715, I took a quick scan of https://github.com/jmar0715/FireSpreadForecasting/blob/dps/model/DataGenerator.py and see a few adjustments that might help reduce memory load and improve performance.

In general, the issues that I see upon my initial pass are as follows:

Not using a context manager with xr.open_dataset. This can lead to a memory leak due to unreleased resources (because the files are not closed), which can compound memory issues when opening numerous files.
xarray uses fsspec to read data, which defaults to an inefficient caching strategy (unless your just streaming the data), so passing cache_type="blockcache" to xr.open_dataset will likely yield significant read performance
Calling ds.load() immediately after opening a dataset. This immediately loads the entire dataset into memory. Since xarray is lazy, only the necessary data should be loaded after selecting/filtering/processing. This can significantly reduce the volume of data loaded into memory.
Using np.isnan creates a new array of the same size as the array passed to np.isnan, so when all you want to do is see if all values in an array are nan, this is an unnecessary use of memory. Instead, using DataArray.count() will avoid the extra memory consumption, returning the number of non-nan values in the array. Thus, a value of 0 means all values in the array are nan.
Using np.array to construct an array from a DataArray. This makes a full copy (by default) of the data in the DataArray, whereas obtaining the value of the values attribute of the DataArray will return an ndarray without copying it.
If you are dealing with reasonably large datasets, you might get a performance boost by passing chunk="auto" to xr.open_dataset, which will use dask arrays to parallelize things.
You are first attempting to use the "netcdf4" engine, and if that fails, using the "h5netcdf". You might get minor performance boost by swapping the order, as the "h5netcdf" engine may perform a bit better.
Since I don't have any context, there may be other ways to reduce memory requirements and boost performance with perhaps more significant refactoring work than what I've mentioned above.

Another thing I would suggest is that you factor out your common code to another function. I see that the logic used for the netcdf4 engine is identical for the logic for the h5netcdf engine. That common code should go into a separate function to eliminate duplication. Although this won't have any impact on your memory usage or performance, it will eliminate the chance of the duplicate code blocks diverging from each other.

chuckwondo commented 1 month ago

Additional context @sujen1412 @wildintellect @jmar0715 did a debugging session. While actively monitoring DPS jobs we noticed the that file IO from S3 was very high and seemed to be the bottleneck during the model training which relied on lazy loading via xarray. We attempted to change the looping structure to load a smaller amount of data by iterating year by year, building the model up over loops. This didn't seem to work either.

@wildintellect, I suspect the high S3 I/O might be due to what I mentioned above, regarding the use of ds.load():

Calling ds.load() immediately after opening a dataset. This immediately loads the entire dataset into memory. Since xarray is lazy, only the necessary data should be loaded after selecting/filtering/processing. This can significantly reduce the volume of data loaded into memory.

moradology commented 1 month ago

This looks like a fun model. A few thoughts:

I'm seeing that ds.load is gone but it appears to me as though the same performance penalty is going to be incurred by calling compute on the all_data_concat here
By the looks of it, concat is going to be noop there because there's only going to be one dataset in all_data at this point. In xarray xr.concat(<list_of_datasets>, dim=<some_dimension>) is going to try to take all the datasets in that list and append them together on the basis of the dimension you provide it (channels in this case, though I suspect that you'd want to concatenate these things by time since you're using t+1 as ground truth) 3a. I'm thinking it might be wise to figure out how the data is currently chunked and build batch sizes around that (assuming they're convenient!) because if we ditch compute and load calls, you're potentially going to be paying a significant price in IO times without (as @chuckwondo mentioned above) using a more aggressive read_ahead caching strategy. 3b. <heresy> It may sound counterintuitive given the love people have for cloud optimized datasets in general but if the goal here is going to be reading the entirety of these netcdfs anyway, it may well be worth verifying that there's a lot to gain by downloading things as-needed. Requests have overhead in both time and money, so if we're going to be reading these things in their entirety anyway, how sure are we that there are significant performance benefits to cloud-optimized reading? Don't get me wrong: it rules, but it's a much more obvious win if we're only anticipating reading bits and pieces of very large files. Of course, if we're expecting vast quantities of data that won't fit on disk in future iterations, this may well be irrelevant.</heresy> EDIT: disregard 3b. I spun up a workspace and did some digging and cloud optimized is 100% the way to go if a single year is more than 100GB

I'll continue to look at this and try to come up with some more thoughts tomorrow. Perhaps we could pair up at some point and run some simple experiments/talk through strategy

moradology commented 3 weeks ago

Adding a note here on some work to document strategy and thought that came from some fruitful discussion with @jmar0715. This function more or less lays out a path for preprocessing tiled data for optimized cloud-native CNN training: https://gist.github.com/moradology/0b7aeed018d660a718045b961f93fe91

moradology commented 3 weeks ago

The function above pre-bakes tiles (which i think is a reasonable path) but that also means it is going to lock in a chip/tile size. That's perhaps not ideal for a few reasons. 1. tile overlap is baked in so we're getting some duplication of bits 2. tile sizes are pre-determined whereas it may be desirable during the testing/experimenting phase of this work to play around with different sizes. If that's not desirable, another path would simply be this: chunk the temporal dimension up one step at a time, avoid chunking spatial dimensions altogether, and use something like RasterVision (shameless plug for some open source work my company has done) to dynamically generate chips. That would mean a bit more flexibility for the reuse of these processed datasets https://docs.rastervision.io/en/stable/api_reference/_generated/rastervision.core.data.raster_source.xarray_source.XarraySource.html#xarraysource

As always, tradeoffs are everywhere

MAAP-Project / Community

FireSpread Algorithm Performance Issues #1058