Closed santisoler closed 5 years ago
@leouieda I've been experimenting with reading Bedmap2 datasets as Dask arrays. This significantly reduces the consumed memory when fetching the dataset. For example, now I can load more than one dataset (e.g. [bed, geoid, surface, ice]
) with minimum RAM requirement.
Although, building the documentation still consumes a lot of memory. The memory population is produced when the gallery plot is generated. I don't have too much knowledge on Dask and how to plot Dask arrays "by chunks", i.e. loading into memory one or a few chunks at a time.
Do you have any idea how to tackle this?
I'm also wondering if we should include some example showing that in order to perform computation involving Dask arrays we must add the .compute()
method.
Lastly, I've added Dask as a dependency because CIs were failing due to dask
module wasn't found.
@santisoler I don't think there is a way around loading the entire dataset when plotting. What we can do is slice it to a smaller areat and then plot it. Maybe just the peninsula?
I was thinking that may be a way that matplotlib generates the plot by chunks, like a human would do it (read the value of the array on a pixel, color it, move to the next one and so on) but haven't found anything.
Plotting something smaller is a nice solution. What about reducing the resolution of the image and plot the entire continent? In fact we could generate both plots: one to show the whole data, and the peninsula one to show the resolution of the model.
What about reducing the resolution of the image and plot the entire continent?
It would be best to avoid this. Simply taking every other point is not really good practice in this case since it might cause aliasing. We'd have to filter the dataset first, which I'm not keen on doing. But you could do it with a simple gaussian filter if you want to do that.
Ok. I don't really want to cut the grid (the plot looks very good haha), but filtering the grid it's a little out of the scope of this issue.
I'm thinking that the some original grids are given in integers, but when we read them with rasterio we get floats (64 bit floats to be exact). What about converting these grids into integers. For some of them we don't need 64bit ints. For example, for the thickness 16 bits integers are enough: 2^16 = 65536, while the maximum value of the thickness is 4621.
This fix will not only make doc build faster, but will ease the memory consumption for every user. What do you think?
I'm thinking that the some original grids are given in integers, but when we read them with rasterio we get floats (64 bit floats to be exact).
I was just checking this and the conversion happens on the .where
call since np.nan
only works on floats and floats are 64bit by default. There is no way of having integers for this since you can't have nans with integers. It would also cause some problems with divisions if users aren't careful. What we can do is use float32
instead to cut the memory down by half (or even float16
).
This should be mentioned in the docstring somewhere.
You are right @leouieda. After I've written the last comment I was wondering if changing the dtype to ints could be problematic when performing mathematical operations with the dataset.
I like the float32
approach. In fact I'm thinking we could add an optional argument for this.
So we can add a warning on the documentation and detail what this dtype
optional argument means, which will make very clear what we are doing under the hood.
In fact I'm thinking we could add an optional argument for this.
That would be great! Whenever we set the dtype of something, it should always be an optional argument that can be controlled by the user.
Ok. After re-reading this sentence:
I was just checking this and the conversion happens on the .where call since np.nan only works on floats and floats are 64bit by default.
I noticed that we are in fact changing the data type when we replace the nodatavals
with nans.
When loading a dataset with xr.open_rasterio
I get the original data-type (e.g. np.int64
for surface and bed, but np.float64
for geoid). Is on the next line where we force them all to be np.float64
.
This changes my thoughts about this. I think we should avoid converting the nodatavals
to nans in order to keep the original data-type.
The nodatavals
is stored as a metadata on each DataArray of the Dataset, so we could let the user to change it if it's necessary for them.
What do you think @leouieda?
PS: I want this little PR to be merged. We could leave the memory consumption of building the bedmap2 example discussion for another issue.
@santisoler I agree that we should get this merged first and then consider what to do about the dtypes.
But I don't like the idea of giving users weird nodatavals instead of nans. We can't even plot the data properly without converting to nans so I imagine everyone will always convert to nans, which defeats the purpose of not converting.
@santisoler please feel free to merge away :+1:
Ok @leouieda! Thanks for the feedback! I'll merge and then open an issue regarding the dtypes and the memory consumption of the bedmap2 example.
Add
chunks
argument tofetch_bedmap2
in order to create a Dask array instead of loading the entire dataset into memory. Also, add keyword arguments to be passed toxarray.open_rasterio
function in case any other parameter of the array creation wants to be changed.Fixes #44
Reminders
make format
andmake check
to make sure the code follows the style guide.doc/api/index.rst
.