danlooo / DGGS.jl

Discrete Global Grid System for Julia
GNU Affero General Public License v3.0
11 stars 2 forks source link

Add hierarchical spatial index #37

Closed danlooo closed 1 year ago

danlooo commented 1 year ago

This PR addresses issue #36, #28, #25, #13. This package aims to provide a data structure to allow convolutional neural networks over equal area cells.

The closed existing solution seems to be Uber H3. However, U3 uses a gnomonic projection which comes with high area distortions. On the other side, DGGRID grids solve this issue by using a Snyder Equal Area projection.

This PR aims to add an Uber H3 like index to DGGRID grids combining the advantages of both packages. Cell center points were stored in a ball tree at all resolutions. Parent cells were calculated based on nearest neighbor search using geodesical Vincenty's distance. Cell indicies were retrieved by combining the cell index of the parent with the bearing of the child center to its parent. This is the hierarchical index will be used as a primary key for the cell ids in storing and parent/neighbor retrieval.

We can not use the original Uber H3 index, because the polyhedron orientation to the earth is different in DGGRID. Furthermore, cells in DGGRID are centered arround polyhedron verticies, whereas cells in Uber H3 are centered across icosahedron face centers.

image Cells being a child of either one hexagon or pentagon cell at DGGRID resolution 3. Child cell index parts in bold are calculated based on the child cells ordered by azimuth from the child cell to its parent. Indicies were composed by concatenating the index of each parent cell at each coarser resolution (Here: DGGRID resolutions 0 bis 3).

danlooo commented 1 year ago

The current hierarchical index and U3 encode the child index as a 3bit number. This makes 8 possible values. However, there are only 7 child hexagons and only 4 child diamonds. If we use the integer representation of the bits directly as numeric indices in YAXArrays, we'd let every 8th cell empty (there is no 8th children). This is unused space in the chunks scattered evenly acrosss the datacube. We can also Just sort the cells by this binary index and then just use the row number as the id. But then we'd lose the direct encoding of parents, children / neighbors...

Adding a meta data table with bits for each resolution and row rumber as the new id would be nice. A complete DGGS can have > 50*10^9 cells (i.e. ISEA4H). We need 12 bytes to encode the hierarchical index. This would be 600 GB just for the index. But can be massively reduced with run length encoding and lazy cartesian products, because the data is sorted.

TLDR; Cache geo lon/lat -> cell id in a ball tree, no meta data table, and write a closed formula to go from int64 running number -> res0 digit. res1 digit . ... Need to handle pentagons as well...

danlooo commented 1 year ago

There is a closed formula to get hierarchical cell index tuple from the linear cell id in memory space:


# get hierarchical index: int to tuple at resolutions
# n is the row number of the sorted table of cartesean product of cells
function hindex(vectors, n)
    vectors_lengths = length.(vectors)
    n_sets = length(vectors)

    res = map(1:4) do set
        if set == n_sets
            n_reps = 1
        else
            n_reps = *(vectors_lengths[(set+1):n_sets]...)
        end
        pos = n |> n -> n - 1 |> n -> (n / n_reps) % vectors_lengths[set] |> floor |> x -> x + 1 |> Int

        vectors[set][pos]
    end
    res
end

vectors = [
    [1, 2, 3, 4], # resolution 3
    [1, 2, 3], # resolution 2
    [1, 2], # resolution 1
    [1] # resolution 0 (base cells)
]

hindex(vectors, 5)
hindex.(Ref(vectors), 1:24)

Indeed, (1,3,1,1) is the 5th tuple. Storing the binarized tuple as int instead would result in 1/8 of cells being empty (3 bits vs 7 children)

danlooo commented 1 year ago

Definition:

An index is a bijective function $f: R^2 \rightarrow R, f(lat,lon)=cellID$ mapping locations x = (lon, lat) into a single dimensional memory space.

Requirements for the index:

  1. Locality: Cells nearby in geographical space should be also nearby in memory space
  2. Hierarchy: All parent cell ids should be directly encoded in the index for fast retrieval of parents and neighbors. This is trivial for rectangular grids and doable for aperture 7 hexagonal grids.
  3. Compactness Every possible address should be used. We want to avoid empty memory space.
  4. Speed: Fast conversion between Int64 memory adress and human readable address vector. E.g. 5 <=> (1,3,1,1)
  5. Unidimensional: Memory is organized in one dimensional blocks. This determines chunking (tiling in 2D), compression and thus loading speed.

Options to encode a hierarchical spatial cell index:

  1. Like H3: 3 bits per resolution as in Uber H3. Every 8th address does not exist and would waste space if directly mapped to memory address i.e. YAXArray index.
  2. Lazy cross product: The cell index is the nth element of the cross product of all cells at all resolutions. (See R function expand_grid)
  3. Heptagonal index Just use one base 7 number for each resolution. Only one empty slot for the pentagons.
danlooo commented 1 year ago

Start again using lookup tables