Implement distance_to_anomaly (Closes #209)

nialov commented 9 months ago

Performance issues are a problem which can be solved by using the alternative functionality that uses gdal_proximity function instead of existing distance_computation implementation. The GDAL implementation is orders of magnitude faster.

Draft for now until I clean up the implementation.

@tomironkko made an initial draft of this function which might have been better than the current shapely distance calculation that uses the existing distance_computation one but more than likely much worse than the gdal_proximity method.

nialov commented 9 months ago

Ready for review. The main function, distance_to_anomaly, works on all platforms while distance_to_anomaly_gdal only works on Linux and MacOS. Based on initial tests I thought the GDAL utility functions would work on all platforms but found out later they are inconsistent on Windows. The main problem on Windows is the lack of osgeo_utils module in the conda environment.

Anyway, thought it best to keep the GDAL method for later reference. I just marked the corresponding test as an xfail when run on Windows. It is orders of magnitude faster and at least somewhat more memory-efficient (filtering to anomalous values is still done in-memory).

If distance_computation, which the main function relies on, requires optimization later, you should check out the original branch by @tomironkko here: https://github.com/GispoCoding/eis_toolkit/compare/master...209-add-distance-to-anomaly. The "raster" distance math approach is more efficient but it can obviously only handle raster data while the current distance_computation handles vector data as well. You could rasterize first and then use raster math. Alternatively the missing osgeo_utils could be investigated.

em-t commented 9 months ago

I did some testing with the non-gdal version using various raster files and using a randomly generated anomaly profile. Everything has looked good so far :+1: I'm wondering, though, if the function should handle nodata values in some more sophisticated way, or is it ok to assume that the user has replaced nodata values with nan for the input raster?

nialov commented 9 months ago

I did some testing with the non-gdal version using various raster files and using a randomly generated anomaly profile. Everything has looked good so far 👍 I'm wondering, though, if the function should handle nodata values in some more sophisticated way, or is it ok to assume that the user has replaced nodata values with nan for the input raster?

Yes, you are correct that if np.nan is not the true nodata value, then the nodata values will be left among the possible anomaly values. I have personally tried to advocate for a centralized approach to handling of nodata but if there are no current plans for such approach, I can implement nodata handling in this function as well (@nmaarnio)?

nialov commented 9 months ago

Ready for review, let me know about the nodata handling!

nmaarnio commented 8 months ago

Let's implement nodata handling in this tool now. However, if you have time and interest to suggest/implement a centralized approach for nodata handling across the toolkit in the upcoming weeks/months, please go ahead.

nialov commented 8 months ago

Nodata handling implemented and ready for review again. Nodata only affects the definition of anomalous values in _fits_criteria. The output will never have nodata values in it.

nialov commented 8 months ago

Thanks for the review, @nmaarnio! Ready for review again 👍

nialov commented 8 months ago

Ready for review/merging 👍 @nmaarnio, @em-t

em-t commented 8 months ago

Looks good!

nmaarnio commented 8 months ago

Merging!

GispoCoding / eis_toolkit

Implement distance_to_anomaly (Closes #209) #324