Return a Dask array when loading Bedmap2

fatiando / rockhound

NOTICE: This library is no longer being developed. Use Ensaio instead (https://www.fatiando.org/ensaio). -- Download geophysical models/datasets and load them in Python

BSD 3-Clause "New" or "Revised" License

34 stars 15 forks source link

Return a Dask array when loading Bedmap2 #45

Closed santisoler closed 5 years ago

santisoler commented 5 years ago

Add chunks argument to fetch_bedmap2 in order to create a Dask array instead of loading the entire dataset into memory. Also, add keyword arguments to be passed to xarray.open_rasterio function in case any other parameter of the array creation wants to be changed.

Fixes #44

Reminders

[ ] Run make format and make check to make sure the code follows the style guide.
[ ] Add tests for new features or tests that would have caught the bug that you're fixing.
[ ] Add new public functions/methods/classes to doc/api/index.rst.
[ ] Write detailed docstrings for all functions/methods.
[ ] If adding new functionality, add an example to the docstring, gallery, and/or tutorials.

santisoler commented 5 years ago

@leouieda I've been experimenting with reading Bedmap2 datasets as Dask arrays. This significantly reduces the consumed memory when fetching the dataset. For example, now I can load more than one dataset (e.g. [bed, geoid, surface, ice]) with minimum RAM requirement. Although, building the documentation still consumes a lot of memory. The memory population is produced when the gallery plot is generated. I don't have too much knowledge on Dask and how to plot Dask arrays "by chunks", i.e. loading into memory one or a few chunks at a time. Do you have any idea how to tackle this?

I'm also wondering if we should include some example showing that in order to perform computation involving Dask arrays we must add the .compute() method.

Lastly, I've added Dask as a dependency because CIs were failing due to dask module wasn't found.

santisoler commented 5 years ago

@santisoler I don't think there is a way around loading the entire dataset when plotting. What we can do is slice it to a smaller areat and then plot it. Maybe just the peninsula?

I was thinking that may be a way that matplotlib generates the plot by chunks, like a human would do it (read the value of the array on a pixel, color it, move to the next one and so on) but haven't found anything.

Plotting something smaller is a nice solution. What about reducing the resolution of the image and plot the entire continent? In fact we could generate both plots: one to show the whole data, and the peninsula one to show the resolution of the model.

leouieda commented 5 years ago

What about reducing the resolution of the image and plot the entire continent?

It would be best to avoid this. Simply taking every other point is not really good practice in this case since it might cause aliasing. We'd have to filter the dataset first, which I'm not keen on doing. But you could do it with a simple gaussian filter if you want to do that.

santisoler commented 5 years ago

Ok. I don't really want to cut the grid (the plot looks very good haha), but filtering the grid it's a little out of the scope of this issue.

I'm thinking that the some original grids are given in integers, but when we read them with rasterio we get floats (64 bit floats to be exact). What about converting these grids into integers. For some of them we don't need 64bit ints. For example, for the thickness 16 bits integers are enough: 2^16 = 65536, while the maximum value of the thickness is 4621.

This fix will not only make doc build faster, but will ease the memory consumption for every user. What do you think?

leouieda commented 5 years ago

I'm thinking that the some original grids are given in integers, but when we read them with rasterio we get floats (64 bit floats to be exact).

I was just checking this and the conversion happens on the .where call since np.nan only works on floats and floats are 64bit by default. There is no way of having integers for this since you can't have nans with integers. It would also cause some problems with divisions if users aren't careful. What we can do is use float32 instead to cut the memory down by half (or even float16).

This should be mentioned in the docstring somewhere.

santisoler commented 5 years ago

You are right @leouieda. After I've written the last comment I was wondering if changing the dtype to ints could be problematic when performing mathematical operations with the dataset. I like the float32 approach. In fact I'm thinking we could add an optional argument for this. So we can add a warning on the documentation and detail what this dtype optional argument means, which will make very clear what we are doing under the hood.

leouieda commented 5 years ago

In fact I'm thinking we could add an optional argument for this.

That would be great! Whenever we set the dtype of something, it should always be an optional argument that can be controlled by the user.

santisoler commented 5 years ago

Ok. After re-reading this sentence:

I was just checking this and the conversion happens on the .where call since np.nan only works on floats and floats are 64bit by default.

I noticed that we are in fact changing the data type when we replace the nodatavals with nans. When loading a dataset with xr.open_rasterio I get the original data-type (e.g. np.int64 for surface and bed, but np.float64 for geoid). Is on the next line where we force them all to be np.float64.

This changes my thoughts about this. I think we should avoid converting the nodatavals to nans in order to keep the original data-type. The nodatavals is stored as a metadata on each DataArray of the Dataset, so we could let the user to change it if it's necessary for them. What do you think @leouieda?

PS: I want this little PR to be merged. We could leave the memory consumption of building the bedmap2 example discussion for another issue.

leouieda commented 5 years ago

@santisoler I agree that we should get this merged first and then consider what to do about the dtypes.

But I don't like the idea of giving users weird nodatavals instead of nans. We can't even plot the data properly without converting to nans so I imagine everyone will always convert to nans, which defeats the purpose of not converting.

leouieda commented 5 years ago

@santisoler please feel free to merge away :+1:

santisoler commented 5 years ago

Ok @leouieda! Thanks for the feedback! I'll merge and then open an issue regarding the dtypes and the memory consumption of the bedmap2 example.