ContinuumIO / earthio

Data reader utilities for machine learning on satellite imagery and Earth science data
32 stars 17 forks source link

Documentation on earthio.load_layers vs xr.open_dataset vs xr.open_mfdataset #40

Closed gbrener closed 6 years ago

gbrener commented 7 years ago

@PeterDSteinberg's discussion in PR https://github.com/ContinuumIO/xarray_filters/pull/44#discussion_r150029450:

We just have a documentation burden on how to explain load_layers vs open_dataset vs open_mfdataset and how MLDataset.load/Dataset.load are typically not called but can be used to convert from lazy to eager.

I took some notes that could be the start of documentation on file loading choices (a bit of thread drift here into documentation needs):

I have one file, so I use load_layers for one file:

NetCDF HDF4 / HDF5 GeoTiff Grib Questions:

What if my "one file" for a NetCDF is actually a URL for an OpenDAP endpoint? Is load_layers broken in this case? We should have this working as in xarray.open_dataset with a NetCDF URL. I have about 8 or 15 GeoTiffs where each is a separate satellite band:

Pass the directory of them as "filename" to load_layers, use LayerSpec to define which bands you want in that directory I have many files that can be loaded with xarray.open_mfdataset:

Just use xarray.open_mfdataset. This supports NetCDF, Grib. Not sure about HDF5? Then call MLDataset(dset) Here's the signature for open_dataset. Maybe we want to walk through how some of the arguments compare to arguments of load_layers

open_dataset(filename_or_obj, group=None, decode_cf=True, mask_and_scale=True, decode_times=True, autoclose=False, concat_characters=True, decode_coords=True, engine=None, chunks=None, lock=None, cache=None, drop_variables=None)

filename_or_obj: Same in load_layers but it may be a directory in the case of GeoTiffs for load_layers

group : This group argument for open_dataset is called layer_spec in load_layers. A list of LayerSpec objects can control which groups are loaded and the load_layers style of LayerSpec applies to all the file types, not just NetCDF.

decode_cf : We should make sure load_layers decodes (True) according to CF conventions, equivalent to passing decode_cf=True.

mask_and_scale : This would be nice to support, but not critical. Here's the help:

If True, replace array values equal to _FillValue with NA and scale values according to the formula original_values * scale_factor + add_offset, where _FillValue, scale_factor and add_offset are taken from variable attributes (if they exist). If the _FillValue or missing_value attribute contains multiple values a warning will be issued and all array values matching one of the multiple values will be replaced by NA.