Nowosad / supercells

The goal of supercells is to utilize the concept of superpixels to a variety of spatial data.
https://jakubnowosad.com/supercells/
GNU General Public License v3.0
66 stars 5 forks source link

How to choose the k parameter? #21

Closed ManuelSpinola closed 1 year ago

ManuelSpinola commented 1 year ago

The example of New Guinea has a k = 2000. Is there any reason to choose that value?

Nowosad commented 1 year ago

Hi @ManuelSpinola, in short, there is not any universally optimal way to decide on the k number. (On a side note: the same is true for compactness).

You just need to know that there are two alternative arguments in supercells() that allow deciding on the resulting number of supercells. The first, k, relates to the number of supercells desired by the user. The second, step is the distance, in the number of cells, between the initial superpixels’ centers (in other words, the initial size of a supercell).

Let's start by reproducing the code from https://jakubnowosad.com/supercells/articles/motifels.html.

library(supercells)    # superpixels for spatial data
library(terra)         # spatial raster data reading and handling
library(sf)            # spatial vector data reading and handling
library(motif)  
landcover = rast(system.file("raster/landcover2015.tif", package = "motif"))
plot(landcover)
comp_output = lsp_signature(landcover, type = "composition", window = 20,
                             normalization = "pdf", ordered = FALSE)
comp_output = lsp_restructure(comp_output)
comp_output = lsp_add_terra(comp_output)
comp_output2 = subset(comp_output, 3:9)
plot(comp_output2)

I can think of four possible approaches to decide on the k number.

  1. Create supercells as small as possible to detect a pattern (depending on the input data, it can be as small as step = 3). Then you may merge similar supercells using a clustering method (for examples, see https://doi.org/10.1016/j.jag.2022.102935).
  2. Create supercells based on existing knowledge of the size of patterns/processes you are studying.
  3. Create supercells based on the spatial scale of interest (e.g., what is the size of regions you want to analyze).
  4. Create supercells by testing different parameters, and visually deciding on the optimal ones. For this approach, I would suggest disabling the additional process of connectivity enforcement (clean = FALSE), then try a few sizes and compare results.
slic1000 = supercells(comp_output2, k = 1000, compactness = 0.1, dist_fun = "jsd", clean = FALSE)
slic2000 = supercells(comp_output2, k = 2000, compactness = 0.1, dist_fun = "jsd", clean = FALSE)
slic4000 = supercells(comp_output2, k = 4000, compactness = 0.1, dist_fun = "jsd", clean = FALSE)
# viz only three first raster layers
library(tmap)
tmap_mode("view")
tm_shape(comp_output2) +
  tm_raster() +
  tm_facets(as.layers = TRUE) +
  tm_shape(slic1000) +
  tm_borders(col = "#7553DB") +
  tm_shape(slic2000) +
  tm_borders(col = "#F2506E") +
  tm_shape(slic4000) + 
  tm_borders(col = "#EBB364")
ManuelSpinola commented 1 year ago

Thank you very much Jakub. I will try that.

Manuel

On Tue, 6 Dec 2022 at 11:30 Jakub Nowosad @.***> wrote:

Hi @ManuelSpinola https://github.com/ManuelSpinola, in short, there is not any universally optimal way to decide on the k number. (On a side note: the same is true for compactness).

You just need to know that there are two alternative arguments in supercells() that allow deciding on the resulting number of supercells. The first, k, relates to the number of supercells desired by the user. The second, step is the distance, in the number of cells, between the initial superpixels’ centers (in other words, the initial size of a supercell).

Let's start by reproducing the code from https://jakubnowosad.com/supercells/articles/motifels.html.

library(supercells) # superpixels for spatial data

library(terra) # spatial raster data reading and handling

library(sf) # spatial vector data reading and handling

library(motif)

landcover = rast(system.file("raster/landcover2015.tif", package = "motif"))

plot(landcover)

comp_output = lsp_signature(landcover, type = "composition", window = 20,

                         normalization = "pdf", ordered = FALSE)

comp_output = lsp_restructure(comp_output)

comp_output = lsp_add_terra(comp_output)

comp_output2 = subset(comp_output, 3:9)

plot(comp_output2)

I can think of four possible approaches to decide on the k number.

  1. Create supercells as small as possible to detect a pattern (depending on the input data, it can be as small as step = 3). Then you may merge similar supercells using a clustering method (for examples, see https://doi.org/10.1016/j.jag.2022.102935).
  2. Create supercells based on existing knowledge of the size of patterns/processes you are studying.
  3. Create supercells based on the spatial scale of interest (e.g., what is the size of regions you want to analyze).
  4. Create supercells by testing different parameters, and visually deciding on the optimal ones. For this approach, I would suggest disabling the additional process of connectivity enforcement (clean = FALSE), then try a few sizes and compare results.

slic1000 = supercells(comp_output2, k = 1000, compactness = 0.1, dist_fun = "jsd", clean = FALSE)

slic2000 = supercells(comp_output2, k = 2000, compactness = 0.1, dist_fun = "jsd", clean = FALSE)

slic4000 = supercells(comp_output2, k = 4000, compactness = 0.1, dist_fun = "jsd", clean = FALSE)

viz only three first raster layers

library(tmap)

tmap_mode("view")

tm_shape(comp_output2) +

tm_raster() +

tm_facets(as.layers = TRUE) +

tm_shape(slic1000) +

tm_borders(col = "#7553DB") +

tm_shape(slic2000) +

tm_borders(col = "#F2506E") +

tm_shape(slic4000) +

tm_borders(col = "#EBB364")

— Reply to this email directly, view it on GitHub https://github.com/Nowosad/supercells/issues/21#issuecomment-1339728555, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFI3FB727HSN7DOHCWSSGE3WL5Z3TANCNFSM6AAAAAASLPBEOY . You are receiving this because you were mentioned.Message ID: @.***>

-- Manuel Spínola, Ph.D. Instituto Internacional en Conservación y Manejo de Vida Silvestre Universidad Nacional Apartado 1350-3000 Heredia COSTA RICA @. @.> @.*** Teléfono: (506) 8706 - 4662 Institutional website: ICOMVIS http://www.icomvis.una.ac.cr/index.php/manuel Blog sobre Ciencia de Datos: https://mspinola-ciencia-de-datos.netlify.app

ailich commented 10 months ago

@Nowosad, would these same parameters work for tuning compactness? And if so, would you be able to provide some guidance on how to choose the range of values to test? Adjusting the formula from your 2021 paper to be in terms of supercells parameters I believe the distance equation should be

$$D= \sqrt{(\frac{d\text{spectral}}{\text{compactness}})^2 +(\frac{d\text{spatial}}{\text{step}})^2} $$ (though I'm unsure if step should be converted from cell to map units).

From this equation I can see that if the same spectral data were run through the SLIC algorithm but it was measured in different units, the compactness parameter would need to change to get an equivalent result. From doing some reading and looking at the equation I know that larger values will emphasize space and be closer to k means clustering of coordinates whereas smaller values will emphasize spectral characteristics more, and that compactness depends on the range of input cell values and selected distance measure. That being said, given the range of data and selected distance measure (euclidean in my case), I'm unsure how to know what a small value for compactness is, what a large value is, and what a value that provides approximately equal weight would be. Do you have any guidance on that?