Open wildintellect opened 2 months ago
@jmar0715, I took a quick scan of https://github.com/jmar0715/FireSpreadForecasting/blob/dps/model/DataGenerator.py and see a few adjustments that might help reduce memory load and improve performance.
In general, the issues that I see upon my initial pass are as follows:
xr.open_dataset
. This can lead to a memory leak due to unreleased resources (because the files are not closed), which can compound memory issues when opening numerous files.cache_type="blockcache"
to xr.open_dataset
will likely yield significant read performanceds.load()
immediately after opening a dataset. This immediately loads the entire dataset into memory. Since xarray is lazy, only the necessary data should be loaded after selecting/filtering/processing. This can significantly reduce the volume of data loaded into memory.np.isnan
creates a new array of the same size as the array passed to np.isnan
, so when all you want to do is see if all values in an array are nan
, this is an unnecessary use of memory. Instead, using DataArray.count()
will avoid the extra memory consumption, returning the number of non-nan
values in the array. Thus, a value of 0
means all values in the array are nan
.np.array
to construct an array from a DataArray
. This makes a full copy (by default) of the data in the DataArray, whereas obtaining the value of the values
attribute of the DataArray
will return an ndarray without copying it.chunk="auto"
to xr.open_dataset
, which will use dask arrays to parallelize things."netcdf4"
engine, and if that fails, using the "h5netcdf"
. You might get minor performance boost by swapping the order, as the "h5netcdf"
engine may perform a bit better.Another thing I would suggest is that you factor out your common code to another function. I see that the logic used for the netcdf4 engine is identical for the logic for the h5netcdf engine. That common code should go into a separate function to eliminate duplication. Although this won't have any impact on your memory usage or performance, it will eliminate the chance of the duplicate code blocks diverging from each other.
Additional context @sujen1412 @wildintellect @jmar0715 did a debugging session. While actively monitoring DPS jobs we noticed the that file IO from S3 was very high and seemed to be the bottleneck during the model training which relied on lazy loading via xarray. We attempted to change the looping structure to load a smaller amount of data by iterating year by year, building the model up over loops. This didn't seem to work either.
@wildintellect, I suspect the high S3 I/O might be due to what I mentioned above, regarding the use of ds.load()
:
- Calling ds.load() immediately after opening a dataset. This immediately loads the entire dataset into memory. Since xarray is lazy, only the necessary data should be loaded after selecting/filtering/processing. This can significantly reduce the volume of data loaded into memory.
This looks like a fun model. A few thoughts:
ds.load
is gone but it appears to me as though the same performance penalty is going to be incurred by calling compute
on the all_data_concat
hereconcat
is going to be noop there because there's only going to be one dataset in all_data
at this point. In xarray xr.concat(<list_of_datasets>, dim=<some_dimension>)
is going to try to take all the datasets in that list and append them together on the basis of the dimension you provide it (channels in this case, though I suspect that you'd want to concatenate these things by time
since you're using t+1 as ground truth)
3a. I'm thinking it might be wise to figure out how the data is currently chunked and build batch sizes around that (assuming they're convenient!) because if we ditch compute
and load
calls, you're potentially going to be paying a significant price in IO times without (as @chuckwondo mentioned above) using a more aggressive read_ahead caching strategy.
3b. <heresy>
It may sound counterintuitive given the love people have for cloud optimized datasets in general but if the goal here is going to be reading the entirety of these netcdfs anyway, it may well be worth verifying that there's a lot to gain by downloading things as-needed. Requests have overhead in both time and money, so if we're going to be reading these things in their entirety anyway, how sure are we that there are significant performance benefits to cloud-optimized reading? Don't get me wrong: it rules, but it's a much more obvious win if we're only anticipating reading bits and pieces of very large files. Of course, if we're expecting vast quantities of data that won't fit on disk in future iterations, this may well be irrelevant.</heresy>
EDIT: disregard 3b. I spun up a workspace and did some digging and cloud optimized is 100% the way to go if a single year is more than 100GBI'll continue to look at this and try to come up with some more thoughts tomorrow. Perhaps we could pair up at some point and run some simple experiments/talk through strategy
Adding a note here on some work to document strategy and thought that came from some fruitful discussion with @jmar0715. This function more or less lays out a path for preprocessing tiled data for optimized cloud-native CNN training: https://gist.github.com/moradology/0b7aeed018d660a718045b961f93fe91
The function above pre-bakes tiles (which i think is a reasonable path) but that also means it is going to lock in a chip/tile size. That's perhaps not ideal for a few reasons. 1. tile overlap is baked in so we're getting some duplication of bits 2. tile sizes are pre-determined whereas it may be desirable during the testing/experimenting phase of this work to play around with different sizes.
If that's not desirable, another path would simply be this: chunk the temporal dimension up one step at a time, avoid chunking spatial dimensions altogether, and use something like RasterVision
(shameless plug for some open source work my company has done) to dynamically generate chips. That would mean a bit more flexibility for the reuse of these processed datasets https://docs.rastervision.io/en/stable/api_reference/_generated/rastervision.core.data.raster_source.xarray_source.XarraySource.html#xarraysource
As always, tradeoffs are everywhere
Is your feature request related to a problem? Please describe. https://github.com/jmar0715/FireSpreadForecasting is running really slow, even with multiple CPUs the job is unlikely to finish with the amount of data and processing in a reasonable amount of time.
Describe the solution you'd like Find a solution to make the algorithm solve with the DPS time limits if possible. @jmar0715 is trying to get outputs to present for AGU in December.
Additional context @sujen1412 @wildintellect @jmar0715 did a debugging session. While actively monitoring DPS jobs we noticed the that file IO from S3 was very high and seemed to be the bottleneck during the model training which relied on lazy loading via xarray. We attempted to change the looping structure to load a smaller amount of data by iterating year by year, building the model up over loops. This didn't seem to work either.
Based on this we don't think that adding a GPU would benefit much at this point. cc: @weiji14 @omshinde
Most likely approach to pursue now is to look at the input netcdf files and optimize a better way to read the portion of data needed for the model training.
Ideas:
cc: @chuckwondo @jsignell