esa-esdl / esdl-core

ESDL Cube Generation and Access API
GNU General Public License v3.0
15 stars 6 forks source link

Aggregate in space by respecting missing data and image resizing #3

Closed forman closed 8 years ago

forman commented 9 years ago

Currently, our source data providers can only deal with equal image sizes and integer multiples. When downsampling we have to aggregate data and handle missing data correctly (see #2).

Downsampling implementation idea: First rescale the source image so that it its size becomes an integer multiple of the original size using nearest neighbor resampling (= pixel duplication). Then combine pixels by aggregation using an averaging method that respects masked arrays.

forman commented 9 years ago
hans-permana commented 9 years ago

I tried using block_reduce to downsample a numpy masked array, but it seemed like the function just ignored any masks associated with it. I tried passing both np.mean and np.ma.mean as the func parameter. Have any of you tried this?

meggart commented 9 years ago

Yes I just tried this, too and can confirm this. I looked a bit into the source of block_reduce and in this line https://github.com/scikit-image/scikit-image/blob/953fbfc539217ad4c92fcf14a9d6dfe1b4b1b793/skimage/util/shape.py#L97 which is part of the function that creates the block views for the matrix, the MaskedArray is converted to a normal Numpy Array and masked values thrown away. Try:

import numpy as np
import numpy.ma
xma=np.ma.MaskedArray(np.zeros((10,10)));
type(xma)
type(np.ascontiguousarray(xma))

One might try to modify this function to do its stride magic on the data and the mask separately. However, since this is all in python, it is probably still quite slow. Should we call external tools like cdo for re-gridding? https://www.rsmas.miami.edu/users/rajib/cdo.pdf They might be less flexible but could produce us results much faster right now.

gunbra32 commented 9 years ago

If you consider working with cdo tools, the up-to-date docimentation might be useful: https://code.zmaw.de/projects/cdo/embedded/cdo.pdf

Cheers,

Gunnar

2015-11-16 16:05 GMT+01:00 Fabian Gans notifications@github.com:

Yes I just tried this, too and can confirm this. I looked a bit into the source of block_reduce and in this line https://github.com/scikit-image/scikit-image/blob/953fbfc539217ad4c92fcf14a9d6dfe1b4b1b793/skimage/util/shape.py#L97 which is part of the function that creates the block views for the matrix, the MaskedArray is converted to a normal Numpy Array and masked values thrown away. Try:

import numpy as npimport numpy.ma xma=np.ma.MaskedArray(np.zeros((10,10)));type(xma)type(np.ascontiguousarray(xma))

One might try to modify this function to do its stride magic on the data and the mask separately. However, since this is all in python, it is probably still quite slow. Should we call external tools like cdo for re-gridding? https://www.rsmas.miami.edu/users/rajib/cdo.pdf They might be less flexible but could produce us results much faster right now.

— Reply to this email directly or view it on GitHub https://github.com/CAB-LAB/cablab-core/issues/3#issuecomment-157060745.

hans-permana commented 9 years ago

I found this code snippet to do the aggregation without ignoring the mask array.

def rebin(a, shape):
    sh = shape[0],a.shape[0]//shape[0],shape[1],a.shape[1]//shape[1]
    return a.reshape(sh).mean(-1).mean(1)

It works as expected when the scaling factor is 0.5. However, when I tried using scaling factor of 0.25, the result is not as expected. It may also be that my understanding of aggregation of data with missing value is not 100% accurate. Could you try this?

meggart commented 9 years ago

I tested it and it worked for me, also for scaling factors of 0.25. I think the main problem with this function is that it will not work for the median. It is calculating the mean for each sub-block first over the columns and then over rows, which works for mean, max, min etc. but will not work non-linear operators.
For example:

x=np.array([[0.0,1.0,1.0],[1.0,1.0,4.0],[1.0,1.0,1.0]])
rebin(x,(1,1))

but the median of these 9 number should be 1.

meggart commented 9 years ago

I did some small changes to the skimage package to make it work for masked arrays. However, I had to delete padding, so it only works if the reduce factor is divisible by the grid size. I made changes to:

skimage/measure/block.py skimage/util/shape.py

see here: https://gist.github.com/meggart/4b3a1f7f23619fe45436

Can you give it a try?

meggart commented 9 years ago

Sorry if my median example was not too clear, please let me know when I write nonsense. I did not include the new definition of the rebin function for using the median, so the full example would be:

def rebin(a, shape):
    sh = shape[0],a.shape[0]//shape[0],shape[1],a.shape[1]//shape[1]
    return np.median(np.median(a.reshape(sh),-1),1) #The equivalent to your example, but for median

x=np.array([[0.0,0.0,1.0],[0.0,0.0,1.0],[1.0,1.0,1.0]])
rebin(x,(1,1))

This returns 0 because it takes the median of each column first [0,0,1], but the median should be 1. This is why I proposed using a modified copy-paste version of the skimage block-reduce function, because it works for more reduction functions.

hans-permana commented 9 years ago

No, no problem at all. Your example was clear enough, it's just that I did not touch Cab-Lab stuff at all yesterday. And today I am back at it. I just pushed a new test class (test_aggregation.py), which is intended for recording all aggregation libraries that we have tried so far. At the moment, I have put the scikit.block_reduce and the rebin functions. Feel free to add more stuff there. I will test your new changes and see if I can add it to the class as well.

forman commented 8 years ago

Fixed by using gridtools

guziy commented 7 years ago

Hi @forman: I had a similar problem recently, just wanted to share my pandas-based way of solving it (in case anyoune is interested).

https://github.com/guziy/RPN/blob/master/src/util/agg_blocks.py#L7

I still get my masked array spoiled by .values, it is converted to nans, not sure if there is a better way to get the initial masked array from the pd.Series...

Cheers

forman commented 7 years ago

@guziy Thanks for the hint! In case you are interested: gridtools is here and it is as fast as comparable C-code (really) thanks to numba.

guziy commented 7 years ago

Thanks @forman:

Yes, mine appears to be slow, and adding @jit decorator makes it fail. I'll try to see If I can use gridtools.

Cheers

forman commented 7 years ago

@guziy Install it like so

$ conda install -c forman gridtools

or just copy the module's source code. It is not well documented yet, just the README. Feel free to fork and post PRs.

guziy commented 7 years ago

Thanks, @forman:

I think I've managed to improve the performance of my function, I have opted for skimage's approach and jitting only the block function, seems to be working ok.

https://github.com/guziy/RPN/blob/8d01a5cc56807e72d67c9d92d76923cc37f68267/src/util/agg_blocks.py#L11

Cheers