OpenDataAnalytics / gaia

Gaia is a geospatial analysis library jointly developed by Kitware and Epidemico.
31 stars 15 forks source link

Raster to NumPy Array Support #84

Closed geordgez closed 7 years ago

geordgez commented 7 years ago

Short description

Added option to output data as NumPy array in RasterFileIO in gaia/geo/geo_inputs.py.

Example Usage

Default functionality matches the NumPy array calls in the docs/examples/gaia_processes.ipynb notebook (example below):

globaltemp = RasterFileIO(uri='../../tests/data/globalairtemp.tif')
temparray = np.array(globaltemp.read().GetRasterBand(1).ReadAsArray())

With the new function the second line can be rewritten:

temparray = globaltemp.read(as_numpy_array=True)

Parameter defaults & descriptions

read(self, as_numpy_array=False, as_single_band=True, new_nodata=None, epsg=None)

as_single_band parameter supports output of 3D NumPy array for multidimensional raster datasets (default is a 2D slice of the first raster band).

new_nodata parameter supports customization of NoData values:

mbertrand commented 7 years ago

Thanks @geordgez, looks good to me. I've added @aashish24 as a reviewer for final approval.

geordgez commented 7 years ago

@aashish24: @mbertrand and I just had a discussion about how to handle NoData values within raster band layers. How should we fill out NoData in the NumPy array output, i.e., should we use NaN, 0, the original NoData value, or some other value?

aashish24 commented 7 years ago

NoData value

I would go with this order. Fill with original NoData value, if none found or none given from the user, then use NaN as the value. I would avoid 0 as 0 could be a valid value.

aashish24 commented 7 years ago

@geordgez @mbertrand see this issue https://github.com/OpenDataAnalytics/gaia/issues/60

geordgez commented 7 years ago

Thanks for the info @aashish24! I just checked to make sure that the most recent commit (54535f1) includes the new parameters old_nodata and new_nodata for the read function with the functionality below in each situation (8 unique cases are covered).

Is this consistent with the desired NoData functionality?

Situations:

Full function signature:

read(
  self, 
  as_numpy_array=False, 
  as_single_band=True, 
  old_nodata=None, 
  new_nodata=None, 
  epsg=None
)
aashish24 commented 7 years ago

@geordgez thanks for putting it together very well. I do not quite understand this:

(old_nodata is not None) and (new_nodata is not None) (2 cases)

How can we have two possibilities?

aashish24 commented 7 years ago

@geordgez also any idea on how we can define a structure behind conversions?

geordgez commented 7 years ago

@aashish24 Sorry I should clarify, 8 cases are for the combinations of old_nodata, new_nodata, and srcband.GetNoDataValue() taking on values of None or not None.

So the combination of (old_nodata is not None) and (new_nodata is not None) includes both possibilities (2 cases, given both old_nodata and new_nodata are None) that (srcband.GetNoDataValue() is None) and (srcband.GetNoDataValue() is not None).

In terms of defining a structure behind conversions, do you mean for going both ways between raster Tiff and NumPy?

aashish24 commented 7 years ago

In terms of defining a structure behind conversions, do you mean for going both ways between raster Tiff and NumPy?

Sorry what I meant is if we should define a base class that defines an API for conversions? Since currently we have a module level function but in the future we may need conversion to more types (pandas df for example)

geordgez commented 7 years ago

Sorry what I meant is if we should define a base class that defines an API for conversions? Since currently we have a module level function but in the future we may need conversion to more types (pandas df for example)

Understood--I think having a conversion class may be a good idea depending on the types of possible conversions within the program. On my end, I need to familiarize myself with the file formats and conversions that exist (or that we may want in the future).

On the one hand, I think it will be helpful to consolidate all the conversions. On the other hand, I want to avoid an ambiguity where a user wants to make a standard conversion (e.g., from NumPy to Pandas) and is unsure whether to use the converter API or to use the standard Pandas function calls.

aashish24 commented 7 years ago

On the one hand, I think it will be helpful to consolidate all the conversions. On the other hand, I want to avoid an ambiguity where a user wants to make a standard conversion (e.g., from NumPy to Pandas) and is unsure whether to use the converter API or to use the standard Pandas function calls.

sure. Did you get a chance to think more about it? I am thinking having a converted API would be nice since I am expecting that we will need that quite a bit in the future.

geordgez commented 7 years ago

@aashish24 Having given some thought, I think the converter API would be a good idea although I'd want to verify some aspects with you and the team. My initial questions/thoughts:

geordgez commented 7 years ago

See #92 and #60

aashish24 commented 7 years ago

@geordgez will reply later today.

aashish24 commented 7 years ago

@geordgez thinking some more on this, lets get merged this one and then we can make another pass on it.

aashish24 commented 7 years ago

Which formats would the API cover? Would it mainly be the raster formats found in gaia.formats?

for now raster, in the future, we should also cover vector (for example geotiff to vector format).

aashish24 commented 7 years ago

Based on what I've seen so far and with the functionality added by @andrenguyen-bah and @chuehlien, NumPy seems to be a "universal" intermediate format for conversion of typical image files since it can support all of the output image types. Would the API be built around intermediate stages with NumPy?

I think that should be fine. I think it is easy to convert from NumPy to other types such as pandas DF.

aashish24 commented 7 years ago

Sorry if this is a dumb question: are there any instances where we would be converting between Raster, Feature, or Vector classes? Or are conversions always between formats within each class?

should be able to go between types. See here: https://docs.qgis.org/2.6/en/docs/training_manual/complete_analysis/raster_to_vector.html

aashish24 commented 7 years ago

@geordgez thoughts?

aashish24 commented 7 years ago

Also, can you squash commits please?

Thanks,

geordgez commented 7 years ago

@aashish24 Sorry for the delay, I got dragged away and likely won't be available for a few weeks--I definitely think going between types should be a capability of the API. Let met know if you need more squashes to commits or amendments to commit comments.

My main concern going forward is how we'd be expanding the API for objects in memory vs objects on disk and the relationship between the two object types, especially with how they interact with NumPy. Converting to raster formats through NumPy should be easy--correct me if I'm wrong--but I think we need to be careful with the process for converting to vector formats.