adebardo commented 1 month ago

Context

The CNES QI team is interested in the possibility of performing block-based coregistration, which is why studies are currently being conducted to test this functionality, referred to as blockwise. At the moment, users must specify the number of subdivisions they wish to apply. However, the QI team would prefer to specify block sizes instead. The goal of this ticket is to propose an implementation of this capability.

We will start the process of managing the size of an overlap between tiles. In this ticket, it will have a default value, and another ticket will be opened later to better manage this overlap.

Proposed Code

Currently, subdivision is handled by the subdivide_array function, so we propose managing block sizes at this level.

Extend Subdivision: Currently, users input a number, but instead, we could allow them to input a dictionary or list containing two integers for rows and columns, and then calculate the number of subdivisions based on these values.
For example, for an image of size 985x1332:
- Scenario 1: Subdivision 64, 8 rows/8 columns, block size of 123 rows by 166 columns.
- Scenario 2: Block size of 100x100, 10 row blocks, 14 column blocks, 10*14 = 140 blocks.

Update the Following:

[ ] Update https://github.com/GlacioHack/xdem/blob/main/xdem/coreg/base.py#L3197 so that, based on row and column information, the function calculates the number of subdivisions when the user provides a list [row, col]: there is an exemple

def subdivide_array(self, shape: tuple[int, ...]) -> NDArrayf:  
"""  
Return the grid subdivision for a given DEM shape.  
:param shape: The shape of the input DEM.  
:returns: An array of shape 'shape' with 'self.subdivision' unique indices.  
"""  
if len(shape) == 3 and shape[0] == 1:  # Account for (1, row, col) shapes  
   shape = (shape[1], shape[2])  

if isinstance(self.subdivision, list):  
   nb_bloc_row = np.ceil(shape[0] / self.subdivision[0])  
   nb_bloc_col = np.ceil(shape[1] / self.subdivision[1])  

   self.subdivision = int(nb_bloc_row * nb_bloc_col)  

return subdivide_array(shape, count=self.subdivision)

[ ] Update the associated docstring.
[ ] Copy a function that allow blocks to overlap

def patchify(arr, nblocks, overlap):
        # manually split with overlap a datacube for parallel computing

        overlap = int(np.floor(overlap))
        patches = []
        nx, ny = np.shape(arr)

        nx_sub = nx // nblocks[0]
        ny_sub = ny // nblocks[1]

        split = [[nx_sub * i, min(nx_sub * (i + 1), nx), ny_sub * j, min(ny_sub * (j + 1), ny)] for i in
                 range(nblocks[0] + 1) for j in range(nblocks[1] + 1)]
        over = [[max(0, l[0] - overlap), min(nx, l[1] + overlap), max(0, l[2] - overlap), min(l[3] + overlap, ny)] for l in
                split]
        inv = []
        for k in range(len(split)):
            x0, x1, y0, y1 = split[k]
            i0, i1, j0, j1 = over[k]
            patches.append(arr[i0:i1, j0:j1])
            inv.append([x0 - i0, x1 - i0, y0 - j0, y1 - j0])

return patches, inv, split

Replace the groups variable here by calling function patchify

Tests

[ ] A test should be added to this test with a parameter containing a list (or a dict) of tile sizes for rows and columns.
[ ] Unit tests should be implemented for patchify function

Documentation

[ ] It would be helpful to add a comment in the plot_blockwise_coreg file to inform users that an alternative solution is available.

/ estimate 5d

adebardo commented 1 month ago

@adehecq, @rhugonnet Currently, internal work is being done to test the blockwise method (and its current bug), but we would like to add this specific functionality. What do you think? @duboise-cnes :)

rhugonnet commented 1 month ago

This is a great idea! Being only able to specify a "number of blocks" is definitely a limitation. Having the option of specifying a "number of blocks" could be retained just for the specific use case (immediately translating it into a size).

For the subdivision of the array: To avoid edge effects during transformations of the chunks, it is usually necessary to use an overlap. I have some old code that does this here (which does not have proper tests written, but was used a lot and checked qualitatively), if can be useful: https://github.com/iamdonovan/pyddem/blob/main/pyddem/fit_tools.py#L1390 (the splitter/stitcher are for subdividing without overlap, and patchify/unpatchify with overlap; all work for a 3D data cube, but should also be applicable to 2D).

The problem is slightly different as a classic split/stitch here, as we don't know in advance the horizontal shift that the coregistration will find for each chunk. So we'd need some kind of splitter (for the fit()) followed by a patchify/unpatchify (for the apply()) that adjusts the size of the overlap based on the transformation to avoid introducing voids. (This could be a way of running apply() that would solve the current bug)

@erikmannerfelt could have more ideas/insights as he implemented the original functionality :slightly_smiling_face:.

rhugonnet commented 3 weeks ago

After more thinking, I'm wondering if directly using https://docs.dask.org/en/latest/generated/dask.array.overlap.map_overlap.html would make more sense, even if for now our input arrays are not chunked in-memory.

It would:

Make the functionality natively work out-of-memory on Dask arrays (probably with some small tweaks) the moment we have a dem Xarray accessor,
Work the same on the loaded NumPy arrays that we currently have,
Remove the requirement of having to write unit tests for our custom functions patchify/unpatchify (as map_overlap is already extensively tested in Dask),
(Potentially) Allow us to re-use similar code for other co-registration steps that require Dask mapping with overlap (#525).

The only downside is that Dask is more complex. For instance, we likely won't be able to use directly map_overlap because the fit() function is quite a specific operation, and spits out metadata instead of another array. We'll probably have to map the chunks with overlap in a custom delayed function. But we should be able to re-use existing code for that, for instance we managed a delayed chunking with overlap and atypical output in interp_points here (which is fully tested): https://github.com/GlacioHack/geoutils/blob/e7538d1cd9bec79508fe1452991cbef8c5078bb9/geoutils/raster/delayed.py#L348. And several of us have accumulated enough experience with it to help! Could be worth it, what do you think? :smile:

duboise-cnes commented 2 weeks ago

After quick reading, I am not sure to understand completely all. Maybe in two phases:

first resolving bug and subdivision by bloc size simply
secondly discuss further on dask integration. maybe in another issue ?

I have discussed with @guillaumeeb from CNES who have more knowledge on dask usage than me. Do you have some insights on this dask impact proposed by @rhugonnet ? A good way to be compatible with dask would benefit the project, i agree.

rhugonnet commented 1 week ago

An additional point I just thought of: If resolving the bug involves removing the warp_dem function, that's for the best. It is likely responsible for artefacts seen in #584, and it would allow to remove scikit-image as a dependency altogether (I opened #654).

Agree on the two phases. My main point was that we maybe shouldn't lose too much time implementing and testing a custom patchify function examplified in the PR description above, when we can use the one existing in Dask (independently of actually having Dask support in terms of memory usage/parallelization for now). So in that sense, Dask would also be slightly part of the first phase but simply as a replacement of the patchify function (which also runs on normal arrays). Then, in a second phase, we could refine and test out-of-memory behaviour, parallelization, and other Dask parameters (for which we would already have the right code structure).

GlacioHack / xdem

[POC] : subdivision by bloc size #601

Context

Proposed Code

Tests

Documentation