Internal quality measures calculation

Nowosad / supercells

The goal of supercells is to utilize the concept of superpixels to a variety of spatial data.

https://jakubnowosad.com/supercells/

GNU General Public License v3.0

66 stars 5 forks source link

Internal quality measures calculation #32

Open Nowosad opened 9 months ago

Nowosad commented 9 months ago

Hi @ailich,

see https://github.com/Nowosad/supercells/tree/estimate_compactness.

I took your code, cleaned it a bit, decided to return all the values (and not just summaries), and added an R function run_ce() -- try its examples.

It seems to work fine.

That being said, I started thinking about the whole process, and have an idea:

Instead of creating new C++ and R functions, we could just add some if/else statements and arguments to existing functions.
For example, there could be a new argument in R's supercells() called meta_dists (better name needed!), and if this argument is TRUE, then three new columns are added to the final supercells: value_distance, spatial_distance, total_distance.
These values would be calculated here (https://github.com/Nowosad/supercells/blob/0e32e3459dbee30976c19fcc99c82e193524810a/src/generate_superpixels.cpp#L51), but only when meta_dists = TRUE and only for the last iteration.
For example, if you just want to get the gist of the distances distributions, you can do meta_dists = TRUE and iter = 1; however, if you keep the default iter = 10 and set meta_dists = TRUE then your resulting supercells will have three new columns
We could even think of calculating one or a few more columns, for example, an average distance for each supercell (which could be thought of as a quality measure)

@ailich, what do you think about this idea?

ailich commented 9 months ago

I actually kind of like it as it's own function run_ce because supercells requires you specify a compactness. meta_dists (or maybe include_distances) though could be useful for providing additional quality metrics on your final result though and an estimate of the heterogeneity in multivariate space within each superpixel .

ailich commented 9 months ago

Also, just a note on run_ce. We departed slightly from the original method I proposed which strictly included only distances between pixels and its closest center, which would require creating Voronoi polygons and extracting pixel values within those polygons and comparing it to the center, and instead we compared centers to all pixels within the 2S x 2S search window around them since that is what is already being done by SLIC. I think that should be good and is likely more appropriate for the method anyway.

ailich commented 9 months ago

There seems to be an issue with run_ce and NA values though. It seems that instead of being removed like in the output from supercells it seems like the result is assigned a distance of 0 probably because I had the distances initialized at 0 here. I thought that if all cells were NA it would be skipped because of this code but I guess that's not the case. The code needs to be able to differentiate between false zeros (all NA's) and true zeros (all values in the search window are the same).

library(supercells)
library(raster)
#> Warning: package 'raster' was built under R version 4.3.1
#> Loading required package: sp
library(sf)
#> Linking to GEOS 3.11.1, GDAL 3.6.2, PROJ 9.1.1; sf_use_s2() is TRUE

vol = terra::rast(system.file("raster/volcano.tif", package = "supercells"))
vol_NA = vol
vol_NA[1:50,1:50]<- NA

plot(vol_NA)


s<- supercells(vol, k = 50, compactness = 1, clean=FALSE)
s_NA<- supercells(vol_NA, k = 50, compactness = 1, clean=FALSE)

plot(st_geometry(s_NA))


cluster_dist<- run_ce(vol_NA, k = 50)
cluster_dist
#>  [1]  0  0  0  0  0 53 69 42  0  0  0  0  0 71 48 68  0  0  0  0  0 38 69 59  0
#> [26]  0  0  0  0 43 69 71  0  0  0  0  0 42 59 76 22 19 21 20 43 29 33 28

length(cluster_dist)
#> [1] 48
nrow(s_NA)
#> [1] 23
nrow(s)
#> [1] 48

^{Created on 2024-01-31 with reprex v2.0.2}

ailich commented 9 months ago

What I like about run_ce is you can do something like this.

library(supercells)
library(raster)
#> Warning: package 'raster' was built under R version 4.3.1
#> Loading required package: sp
library(sf)
#> Linking to GEOS 3.11.1, GDAL 3.6.2, PROJ 9.1.1; sf_use_s2() is TRUE

vol = terra::rast(system.file("raster/volcano.tif", package = "supercells"))

cluster_dist<- run_ce(vol, k=50)
summary(cluster_dist)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    5.00   27.75   36.00   39.71   49.25   76.00

v1<- supercells(vol, k=50, compactness = median(cluster_dist)) #Approximately equal weight to spatial and feature distance

v2<- supercells(vol, k=50, compactness = median(cluster_dist)/10) #Feature space weighted approximately 10 times greater than spatial distance

plot(st_geometry(v1))

plot(st_geometry(v2))

^{Created on 2024-01-31 with reprex v2.0.2}

Nowosad commented 9 months ago

Hi @ailich -- I hope the code in the branch is mostly (except for NAs) fine for you. I've been working on an alternative development on this today. I found one obstacle, and I will inform you when a new version is ready.

Thanks for the data examples, and letting me know that having a new function is a better approach. Do you have any suggestions on how it should be named?

ailich commented 9 months ago

Thanks @Nowosad. Other than the NA thing it's all good with me. I think of run_ce as "run compactness estimation," so the name is fine. The function does need some documentation though. Perhaps something such as:

"This function returns the maximum distance in feature space between each initial cluster center and all pixels within its surrounding search window. The mean or median of these distances can be a good starting point for the testing compactness values that provide approximately equal weight to geographic and feature space distance in the supercells function for your data and chosen distance measure."

Additionally in the documentation for supercells it says " A compactness value depends on the range of input cell values and selected distance measure." Perhaps at the end of that add something such as "(see run_ce function)" to point people in that direction.

Nowosad commented 9 months ago

@Nowosad Hi Alex -- see the attached code. In short:

You need to install the package from a different branch
Now, the new function is named get_initial_distances() and it returns supercells after only 1 iteration -- this way you can visualize the initial results, and also -- there should not be any problems with missing values
I also think (please check the code, and correct me if I am wrong) that to create exactly equal value/spatial weights, the calculations just a bit more complex than just getting a median of max value distance:

# remotes::install_github("nowosad/supercells@comp2")
library(supercells)

# set initial params
my_step = 15
my_compactness = 15

# read data and calc initial max distances
vol = terra::rast(system.file("raster/volcano.tif", package = "supercells"))
vol_dists = get_initial_distances(vol, step = my_step, compactness = my_compactness)
vol_dists
#> Simple feature collection with 24 features and 7 fields
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: 2667400 ymin: 6478705 xmax: 2668010 ymax: 6479575
#> Projected CRS: NZGD49 / New Zealand Map Grid
#> First 10 features:
#>    supercells       x       y elevation max_value_dist max_spatial_dist
#> 1           1 2667465 6479500  95.50213             27         19.79899
#> 2           2 2667475 6479350 102.84000             37         20.51828
#> 3           3 2667465 6479190 111.66284             47         20.51828
#> 4           4 2667475 6479070 128.45963             58         20.51828
#> 5           5 2667475 6478910 127.97794             60         20.51828
#> 6           6 2667465 6478790 109.64032             80         20.51828
#> 7           7 2667605 6479500 101.16667             41         20.51828
#> 8           8 2667645 6479360 125.74419             39         21.21320
#> 9           9 2667625 6479210 141.43529             41         21.21320
#> 10         10 2667605 6479020 164.58084             58         21.21320
#>    max_total_dist                       geometry
#> 1        2.232089 MULTIPOLYGON (((2667400 647...
#> 2        2.720294 MULTIPOLYGON (((2667540 647...
#> 3        3.350622 MULTIPOLYGON (((2667530 647...
#> 4        4.085748 MULTIPOLYGON (((2667550 647...
#> 5        4.109609 MULTIPOLYGON (((2667400 647...
#> 6        5.505956 MULTIPOLYGON (((2667400 647...
#> 7        3.035347 MULTIPOLYGON (((2667540 647...
#> 8        2.915857 MULTIPOLYGON (((2667700 647...
#> 9        3.056505 MULTIPOLYGON (((2667700 647...
#> 10       4.117173 MULTIPOLYGON (((2667670 647...
plot(vol_dists)


# perform basic calculations
spatial_part = (mean(vol_dists$max_spatial_dist) / my_step)^2
value_part = (mean(vol_dists$max_value_dist) / my_compactness)^2
sqrt(spatial_part + value_part) #total
#> [1] 3.943791

# create supercells with equal spatial/value weights
equal_parts = sqrt(value_part / spatial_part)
my_compactness2 = my_compactness * equal_parts
spatial_part2 = (mean(vol_dists$max_spatial_dist) / my_step)^2
value_part2 = (mean(vol_dists$max_value_dist) / my_compactness2)^2

vol_sp = supercells(vol, step = my_step, compactness = my_compactness2, clean = FALSE)
plot(vol_sp["elevation"])


# create supercells with values 10 times more important
my_compactness3 = my_compactness2/sqrt(10)
spatial_part3 = (mean(vol_dists$max_spatial_dist) / my_step)^2
value_part3 = (mean(vol_dists$max_value_dist) / my_compactness3)^2

vol_sp2 = supercells(vol, step = my_step, compactness = my_compactness3, clean = FALSE)
plot(vol_sp2["elevation"])

ailich commented 9 months ago

@Nowosad, since this is meant to estimate the compactness factor, it shouldn't require a specification for compactness in the function itself. That being said, you mentioned that the weighting between geographic and feature space is more complex than in my example, and am am realizing that my initial logic likely needs to be tweaked at least slightly. The initial thought process was based on the fact that since step is used as a divisor for the geographic space to normalize it from 0-1, then a compactness factor that normalizes the feature space distance from 0-1 would accomplish equal weighting. The difficulty is that this distance cannot be determined ahead of time and will differ from cluster to cluster. The original SLIC paper mentions this and says that instead compactness can be set to a constant and used to control the weighting between geographic and feature space. My idea was that setting this constant to be the median (or mean) of the maximum distance between all pixels in the the search window and the value at the centroid would find a compactness value that would on average normalize the feature space distance to 1, therefore providing equal weighting. What I am realizing though is that this is comparing the feature space to the feature space values at the centroid in geographic space. It would probably be more appropriate to calculate the centroid in feature space across the search window and then measure the distance between each pixel with respect to that.

Alternatively, like you suggested earlier, perhaps a meta_dists argument could be used and return some additional distances that could be used to assess the result. For example, the maximum geographic distance between polygon nodes and the final centroids could be returned as a column as well as the maximum feature space distance of each pixel in the cluster from the centroid of that cluster in feature space. Then the geographic distance/step can be compared to the value of feature space distance/compactness to see what the weighting is in the map. Rather than estimating this prior to running the tool to get a starting point for compactness, this would assess the final result and see if the weightings are reasonable or if you feel that maybe you should change your compactness since you are weighting one too heavily.

Nowosad commented 9 months ago

Hi @ailich:

I think there are three topics to discuss.

TOPIC 1

since this is meant to estimate the compactness factor, it shouldn't require a specification for compactness in the function itself.

There are three options I can see:

Keep the current situation. The compactness is used to: a) add the total distance column, and b) create polygons after one iteration.
If I remove the compactness argument, then the output could be a simple data frame with a few columns
If I remove the compactness argument, then the output could be a point sf object with a few columns.

Please let me know what the best option is for you.

TOPIC 2

It would probably be more appropriate to calculate the centroid in feature space across the search window and then measure the distance between each pixel with respect to that.

That's an interesting idea! But I cannot think of how to implement it without a major effort. I do not have the capacity for that, but please let me know if you want to try to write something like this...

TOPIC 3

Alternatively, like you suggested earlier, perhaps a meta_dists argument could be used and return some additional distances that could be used to assess the result.

I dropped this idea (at least for now). Why? It only works with clean = FALSE. If cleaning is on, then the connections between the clusters are improved, but the distances are not recalculated. Thus, I would need to write another code layer to recalculate it. I do not think if this is worth it.

ailich commented 8 months ago

Sorry for the late reply. Had a couple other things I was working on and wanted to make sure I had the time to create a thoughtful reply.

Background

Just some background as to what the goal is. Based on the paper that proposed the method, "The maximum spatial distance expected within a given cluster should correspond to the sampling interval (step). Determining the maximum color Nc distance is not so straightforward, as color distances can vary significantly from cluster to cluster and image to image. This problem can be avoided by fixing Nc to a constant m (compactness)." The goal is therefore to have a function that tells you how much "m" (the compactness) is weighting feature space relative to geographic space.

Topic 1

Probably the most informative would be an sf object with the corresponding distance columns, but as mentioned in a previous comment, it's actually probably more appropriate to compare to the centroid in feature space rather than the feature values of the point located at the centroid in geographic space.

Topic 2

Here's some code to essentially accomplish that. The slight difference is that the distance is calculated in sf so it's using geographic space rather than number of pixels so it may be different if the x and y raster resolution are unequal. Also, it's coded in R and may be redoing calculations already done by supercells so it's not necessarily efficient.

The procedure here though is

Use supercells with iter=0 to get initial cluster centers. compactness needs to be specified to get the function to run but with iter=0 it has no impact on the results.
Delineate Voroni polygons around each initial cluster center.
Extract feature values in each polygon
Calculate the centroid in feature space of each polygon
Find the maximum feature space distance between each pixel contained within a cluster and the feature space centroid for that cluster.
Find the median (or mean) of those values to represent a value of compactness that would weight geographic and feature space approximately evenly. Note: Although this is how I'd like to interpret this value, I'm not 100% sure it's valid.
Scale that value based on how you would like to weigh geographic and feature space.

#############
library(terra)
#> Warning: package 'terra' was built under R version 4.3.2
#> terra 1.7.71
library(supercells)
library(sf)
#> Linking to GEOS 3.11.1, GDAL 3.6.2, PROJ 9.1.1; sf_use_s2() is TRUE
library(tidyverse)

# Create a sample data frame with points
st_voronoi_point <- function(points){
  ## points must be POINT geometry
  # check for point geometry and execute if true
  if(!all(st_geometry_type(points) == "POINT")){
    stop("Input not  POINT geometries")
  }
  g = st_combine(st_geometry(points)) # make multipoint
  v = st_voronoi(g)
  v = st_collection_extract(v)
  return(v[unlist(st_intersects(points, v))])
} # https://gis.stackexchange.com/questions/362134/i-want-to-create-a-voronoi-diagram-while-retaining-the-data-in-the-data-frame

# Load data
vol = terra::rast(system.file("raster/volcano.tif", package = "supercells"))
vol<- c(vol, terrain(vol, "slope")) #add second layer
bbox<- st_as_sf(vect(ext(vol))) #extent
distance_method<- "euclidean"

initial_centers<- supercells(vol, step=10, compactness = 1, iter=0) #compactness does not affect initial centers

voronoi_polygons <- st_voronoi_point(initial_centers) #create polygons around initial centers
voronoi_polygons<- st_crop(voronoi_polygons, bbox)

# Plot the Voronoi polygons
plot(st_geometry(voronoi_polygons), main = "Voronoi Polygons")
plot(initial_centers, add = TRUE, pch = 20, col = "red")


vals<- terra::extract(vol, vect(voronoi_polygons)) # Extract values within each polygon
feature_centroids<- vals %>% group_by(ID) %>% summarize(across(everything(), function(x) median(x, na.rm = TRUE))) #Centroids in feature space per polygon

max_feature_dist<- rep(NA_real_, length(voronoi_polygons)) # maximum distance in feature space within a given polygon

for (i in 1:nrow(vals)) {
  ID<- vals$ID[i]
  curr_vals<- vals[i,-1]
  if(any(is.na(curr_vals))){next()}
  curr_dist<- dist(rbind(curr_vals, feature_centroids[ID,-1]), method = distance_method) #distance from centroid in feature space
  if(is.na(max_feature_dist[ID]) | curr_dist > max_feature_dist[ID]){
    max_feature_dist[ID]<- curr_dist #Update distance
    }
  }

s1<- supercells(vol, step=10, compactness = median(max_feature_dist), iter=10) 
plot(st_geometry(s1))

^{Created on 2024-02-28 with reprex v2.0.2}

Topic 3

The code from topic 2 could be edited and used on clean=TRUE. I'd need to just add some lines to calculate the distance from each node in the polygon to the center of the polygon. This could be used afterwards to evaluate how compactness affected the final result by directly comparing the maximum geographic distance between pixels and the their corresponding cluster center in geographic space, and then looking at the maximum distance between pixels and the their corresponding cluster center in feature space.

Nowosad commented 8 months ago

Hi @ailich -- thanks for detailed message. I will try this approach in the next few days and will give my feedback.

(One thing -- the code probably would need to use some custom dist function, as the package allows for all distances from the philentropy package (and a few more) plus custom ones)

Nowosad commented 8 months ago

Hi @ailich -- I check the code and it makes sense (I think). Feel free to wrap it up as a function: it would be great if you could: a) avoid adding new dependencies (e.g., dplyr), b) use distance function from philentropy instead of dist.

Best, Jakub

ailich commented 8 months ago

@Nowosad, I wrapped the previous code in a function called est_fsd for estimating feature space distance. I use philentropy for distance calculations now and removed the need for dplyr. How can I call your distance functions so I can add those in?

library(sf)
#> Linking to GEOS 3.11.1, GDAL 3.6.2, PROJ 9.1.1; sf_use_s2() is TRUE
library(terra)
#> Warning: package 'terra' was built under R version 4.3.2
#> terra 1.7.71
library(supercells)
library(philentropy)
#> Warning: package 'philentropy' was built under R version 4.3.3
#> 
#> Attaching package: 'philentropy'
#> The following object is masked from 'package:terra':
#> 
#>     distance

# Create a sample data frame with points
st_voronoi_point <- function(points){
  ## points must be POINT geometry
  # check for point geometry and execute if true
  if(!all(st_geometry_type(points) == "POINT")){
    stop("Input not  POINT geometries")
  }
  g = st_combine(st_geometry(points)) # make multipoint
  v = st_voronoi(g)
  v = st_collection_extract(v)
  return(v[unlist(st_intersects(points, v))])
} # https://gis.stackexchange.com/questions/362134/i-want-to-create-a-voronoi-diagram-while-retaining-the-data-in-the-data-fram

est_fsd<- function(x, k, dist_fun="euclidean", avg_fun="mean", step, sf=TRUE){
  bbox<- st_as_sf(vect(ext(x))) #extent
  initial_centers<- supercells(x, k=k,step=step, compactness = 1, iter=0) #compactness does not affect initial centers
  voronoi_polygons <- st_voronoi_point(initial_centers) #create polygons around initial centers
  voronoi_polygons<- st_crop(voronoi_polygons, bbox)
  vals<- terra::extract(x, vect(voronoi_polygons)) # Extract values within each polygon

  unique_IDs<- unique(vals$ID)
  feature_centroids<- as.data.frame(matrix(data=NA_real_, nrow = length(unique_IDs), ncol = ncol(vals)))
  names(feature_centroids)<- names(vals)
  feature_centroids$ID<- unique_IDs
  for (i in 1:length(unique_IDs)) {
    curr_vals<- vals[vals$ID==unique_IDs[i],] #subset rows to polygon ID
    feature_centroids[i,-1]<- sapply(curr_vals[,-1], avg_fun, na.rm=TRUE)
  }
  rm(curr_vals,i)

  max_feature_dist<- rep(NA_real_, length(voronoi_polygons)) # maximum distance in feature space within a given polygon
  for (i in 1:nrow(vals)) {
    ID<- vals$ID[i]
    curr_vals<- vals[i,-1]
    if(any(is.na(curr_vals))){next()}
    curr_dist<- philentropy::distance(rbind(curr_vals, feature_centroids[ID,-1]), method = dist_fun, mute.message = TRUE) #distance from centroid in feature space
    if(is.na(max_feature_dist[ID]) | curr_dist > max_feature_dist[ID]){
      max_feature_dist[ID]<- curr_dist #Update distance
    }
  }
  rm(ID, curr_vals, curr_dist)

  if(sf){
    max_feature_dist<- data.frame(ID= unique_IDs, MaxDist= max_feature_dist)
    st_geometry(max_feature_dist)<- st_geometry(voronoi_polygons)
  }
  return(max_feature_dist)
}

vol = terra::rast(system.file("raster/volcano.tif", package = "supercells"))
vol<- c(vol, terrain(vol, "slope")) #add second layer

y<- est_fsd(vol, dist_fun = "euclidean", avg_fun = "median", step = 10)
plot(y[,"MaxDist"])

^{Created on 2024-03-26 with reprex v2.0.2}

ailich commented 8 months ago

@Nowosad I'll get started on the function for evaluating the final result which involves getting the distance between each node and the polygon center in geographic space. While starting on that I noticed that the x, and y column of a supercells object doesn't match the result of st_centroid. Do you know why that is the case even when data has equal x/y pixel resolution?

library(terra)
#> Warning: package 'terra' was built under R version 4.3.2
#> terra 1.7.71
library(sf)
#> Linking to GEOS 3.11.1, GDAL 3.6.2, PROJ 9.1.1; sf_use_s2() is TRUE
library(supercells)

vol = terra::rast(system.file("raster/volcano.tif", package = "supercells"))
vol<- c(vol, terrain(vol, "slope")) #add second layer

out<- supercells(vol, compactness = 1, step = 10)
centers1<- data.frame(X=out$x, Y=out$y)
centers2<- as.data.frame(st_coordinates(st_centroid(out)))
#> Warning: st_centroid assumes attributes are constant over geometries

print(centers1-centers2)
#>             X         Y
#> 1  -8.7058824 5.4117647
#> 2  -7.6428571 1.3571429
#> 3  -3.0681818 2.8409091
#> 4  -2.3170732 4.7560976
#> 5  -0.6896552 2.9885057
#> 6  -2.1052632 6.0526316
#> 7  -5.6962025 4.4303797
#> 8  -2.5454545 9.6363636
#> 9  -8.6440678 3.3898305
#> 10 -4.8809524 0.1190476
#> 11 -1.5151515 6.3636364
#> 12 -1.4772727 4.2045455
#> 13 -1.3793103 1.2643678
#> 14 -8.5714286 8.9285714
#> 15 -6.5116279 3.9534884
#> 16 -8.5714286 2.8571429
#> 17 -4.5588235 5.0000000
#> 18 -9.4736842 8.4210526
#> 19 -5.3731343 0.4477612
#> 20 -0.4819277 9.8795181
#> 21 -2.0661157 2.6446281
#> 22 -2.3437500 9.2187500
#> 23 -9.3181818 7.2727273
#> 24 -9.8148148 1.1111111
#> 25 -2.4657534 5.6164384
#> 26 -0.4065041 6.2601626
#> 27 -5.7272727 2.3636364
#> 28 -0.8571429 7.8571429
#> 29 -8.8461538 1.7307692
#> 30 -5.7000000 8.7000000
#> 31 -1.2195122 0.4878049
#> 32 -0.5714286 3.1428571
#> 33 -8.3561644 3.8356164
#> 34 -8.2758621 1.7241379
#> 35 -9.1764706 6.5882353
#> 36 -5.8653846 4.4230769
#> 37 -9.6103896 6.1038961
#> 38 -2.9411765 0.2941176
#> 39 -9.7674419 7.9069767
#> 40 -8.9024390 6.9512195
#> 41 -5.4285714 7.4285714
#> 42 -8.3333333 2.2222222
#> 43 -2.9702970 3.3663366
#> 44 -3.2038835 7.2815534
#> 45 -8.4285714 5.4285714
#> 46 -3.3913043 2.0869565
#> 47 -0.7920792 1.2871287
#> 48 -1.0169492 4.9152542
#> 49 -4.5535714 3.0357143
#> 50 -3.7391304 2.2608696
#> 51 -9.3150685 2.1917808
#> 52 -2.9090909 6.2727273
#> 53 -2.6984127 3.6507937
#> 54 -5.8536585 4.8780488
#> 55 -8.9908257 3.0275229
#> 56 -3.6619718 6.0563380
#> 57 -9.6153846 0.2564103
#> 58 -1.3385827 2.5196850
#> 59 -5.5483871 1.7419355

^{Created on 2024-03-26 with reprex v2.0.2}

Nowosad commented 7 months ago

@ailich

How can I call your distance functions so I can add those in?

What do you mean by "your distance functions"?

Regarding the second issue -- I will try to investigate this...

ailich commented 7 months ago

@Nowosad by your distance functions. I meant the ones implemented from outside philentropy. In the documentation it specifically mentions "euclidean", "jsd", "dtw" (dynamic time warping) and they seem to be implemented here. Can I call the get_vals_dist function from R?

Nowosad commented 7 months ago

@ailich thanks for letting me know about the centroids issue. See now:

library(terra)
#> terra 1.7.73
library(sf)
#> Linking to GEOS 3.12.1, GDAL 3.7.3, PROJ 9.2.1; sf_use_s2() is TRUE
devtools::load_all()
#> ℹ Loading supercells

vol = terra::rast(system.file("raster/volcano.tif", package = "supercells"))
vol<- c(vol, terrain(vol, "slope")) #add second layer

out<- supercells(vol, compactness = 1, step = 57, iter = 1, clean = FALSE)
centers1<- data.frame(X=out$x, Y=out$y)
centers2<- as.data.frame(st_coordinates(st_centroid(out)))
#> Warning: st_centroid assumes attributes are constant over geometries
centers1
#>         X       Y
#> 1 2667710 6479203
#> 2 2667691 6478974
centers2
#>         X       Y
#> 1 2667710 6479203
#> 2 2667691 6478974

In short, the cpp code is using integers to represent supercells centers, thus rounding the intermediate values. When I switch the code to use doubles -- it is consistent with sf. See https://github.com/Nowosad/supercells/pull/37.

Regarding your other question -- I think that the solution would be to export some C++ code (as now it is invisible to R). I am going to visit my family for a few days, so I will try to do it next week -- I hope that is fine with you.

ailich commented 7 months ago

Sounds good. Thanks for the quick fix on the centroids, and enjoy spending time with your family!

Nowosad commented 7 months ago

Hi @ailich, see the new branch -- https://github.com/Nowosad/supercells/pull/38:

devtools::load_all()
#> ℹ Loading supercells
x = 1.1:10.1
y = 2.1:11.1
dist_fun = function(){}
supercells:::get_vals_dist_cpp11(x, y, "euclidean", dist_fun)
#> [1] 3.162278
supercells:::get_vals_dist_cpp11(x, y, "jsd", dist_fun)
#> [1] 0.4140878
supercells:::get_vals_dist_cpp11(x, y, "dtw", dist_fun)
#> [1] 10
supercells:::get_vals_dist_cpp11(x, y, "dtw2d", dist_fun)
#> [1] 2.828427
supercells:::get_vals_dist_cpp11(x, y, "jensen-shannon", dist_fun) #philentropy
#> [1] 0.4140878
supercells:::get_vals_dist_cpp11(x, y, "", dist_fun = function(x, y){sum(c(x, y))})
#> [1] 122

ailich commented 7 months ago

@Nowosad, I have to still add the supercells:::get_vals_dist_cpp11 but here's what I have so far.

Load Packages and define functions

library(terra)
#> Warning: package 'terra' was built under R version 4.3.2
#> terra 1.7.71
library(supercells)
library(sf)
#> Linking to GEOS 3.11.1, GDAL 3.6.2, PROJ 9.1.1; sf_use_s2() is TRUE
library(tidyverse)
library(philentropy)
#> Warning: package 'philentropy' was built under R version 4.3.3
#> 
#> Attaching package: 'philentropy'
#> The following object is masked from 'package:terra':
#> 
#>     distance

# Create a sample data frame with points
st_voronoi_point <- function(points){
  ## points must be POINT geometry
  # check for point geometry and execute if true
  if(!all(st_geometry_type(points) == "POINT")){
    stop("Input not  POINT geometries")
  }
  g = st_combine(st_geometry(points)) # make multipoint
  v = st_voronoi(g)
  v = st_collection_extract(v)
  return(v[unlist(st_intersects(points, v))])
} # https://gis.stackexchange.com/questions/362134/i-want-to-create-a-voronoi-diagram-while-retaining-the-data-in-the-data-frame

est_fsd<- function(x, k, dist_fun="euclidean", avg_fun="mean", step, sf=TRUE){
  bbox<- st_as_sf(vect(ext(x))) #extent
  initial_centers<- supercells(x, k=k,step=step, compactness = 1, iter=0) #compactness does not affect initial centers
  voronoi_polygons <- st_voronoi_point(initial_centers) #create polygons around initial centers
  voronoi_polygons<- st_crop(voronoi_polygons, bbox)
  vals<- terra::extract(x, vect(voronoi_polygons)) # Extract values within each polygon

  unique_IDs<- unique(vals$ID)
  feature_centroids<- as.data.frame(matrix(data=NA_real_, nrow = length(unique_IDs), ncol = ncol(vals)))
  names(feature_centroids)<- names(vals)
  feature_centroids$ID<- unique_IDs
  for (i in 1:length(unique_IDs)) {
    curr_vals<- vals[vals$ID==unique_IDs[i],] #subset rows to polygon ID
    feature_centroids[i,-1]<- sapply(curr_vals[,-1], avg_fun, na.rm=TRUE)
  }
  rm(curr_vals,i)

  max_feature_dist<- rep(NA_real_, length(voronoi_polygons)) # maximum distance in feature space within a given polygon
  for (i in 1:nrow(vals)) {
    ID<- vals$ID[i]
    curr_vals<- vals[i,-1]
    if(any(is.na(curr_vals))){next()}
    curr_dist<- philentropy::distance(rbind(curr_vals, feature_centroids[ID,-1]), method = dist_fun, mute.message = TRUE) #distance from centroid in feature space
    if(is.na(max_feature_dist[ID]) | curr_dist > max_feature_dist[ID]){
      max_feature_dist[ID]<- curr_dist #Update distance
    }
  }
  rm(ID, curr_vals, curr_dist)

  if(sf){
    max_feature_dist<- data.frame(ID= unique_IDs, MaxDist= max_feature_dist)
    st_geometry(max_feature_dist)<- st_geometry(voronoi_polygons)
  }
  return(max_feature_dist)
}

eval_compactness<- function(x, y, dist_fun, avg_fun){
  vals<- terra::extract(x, vect(y)) # Extract values within each polygon
  unique_IDs<- unique(vals$ID)
  feature_centroids<- as.data.frame(matrix(data=NA_real_, nrow = length(unique_IDs), ncol = ncol(vals)))
  names(feature_centroids)<- names(vals)
  feature_centroids$ID<- unique_IDs
  for (i in 1:length(unique_IDs)) {
    curr_vals<- vals[vals$ID==unique_IDs[i],] #subset rows to polygon ID
    feature_centroids[i,-1]<- sapply(curr_vals[,-1], avg_fun, na.rm=TRUE)
  }
  rm(curr_vals,i)

  max_feature_dist<- rep(NA_real_, nrow(y)) # maximum distance in feature space within a given polygon
  for (i in 1:nrow(vals)) {
    ID<- vals$ID[i]
    curr_vals<- vals[i,-1]
    if(any(is.na(curr_vals))){next()}
    curr_dist<- philentropy::distance(rbind(curr_vals, feature_centroids[ID,-1]), method = dist_fun, mute.message = TRUE) #distance from centroid in feature space
    if(is.na(max_feature_dist[ID]) | curr_dist > max_feature_dist[ID]){
      max_feature_dist[ID]<- curr_dist #Update distance
    }
  }
  rm(ID, curr_vals, curr_dist)
  max_poly_dist<-  rep(NA_real_, nrow(y))
  for (i in 1:nrow(y)) {
    polygon_nodes<- st_coordinates(st_geometry(y[i,]))[,1:2]
    polygon_nodes<- polygon_nodes[-nrow(polygon_nodes),]

    for (j in 1:nrow(polygon_nodes)) {
      curr_dist<- sqrt((polygon_nodes[j,1]-y$x[i])^2 + (polygon_nodes[j,2]-y$y[i])^2)
      if(is.na(max_poly_dist[i]) | curr_dist > max_poly_dist[i]){
        max_poly_dist[i]<- curr_dist #Update distance
      }
    }
  }

  y$MaxFeatDist<- max_feature_dist
  y$MaxPolyDist<- max_poly_dist/mean(res(x))
  return(y)
}

Load Data

# Load data
vol = terra::rast(system.file("raster/volcano.tif", package = "supercells"))
vol<- c(vol, terrain(vol, "slope")) #add second layer
step<-10

Conduct Supercells Analysis

Use the est_fsd function to estimate an initial compactness value that balances between geographic and feature space based on your specific data and distance measure. We will also try a lower and higher value to see how this changes results. We can then use eval_compactness on the final result. If there is a large difference between the chosen compactness and the average maximum feature distance across clusters then that may indicate that you should change the compactness value.

compactness_medium<- est_fsd(vol, step=step, dist_fun = "euclidean", avg_fun = "median", sf = FALSE) |> median()
compactness_medium
#> [1] 20.42507
compactness_high<- compactness_medium*100 #Increase importance of geographic space and decrease importance of feature space
compactness_high
#> [1] 2042.507
compactness_low<- compactness_medium/100 #Decrease importance of geographic space and increase importance of feature space
compactness_low
#> [1] 0.2042507

High Compactness

High compactness weights geographic space highly and feature space lower.

supercells_high<- supercells(vol, step=step, compactness = compactness_high, iter=10,clean=FALSE) 
plot(st_geometry(supercells_high))

high_eval<- eval_compactness(vol, supercells_high, dist_fun = "euclidean", avg_fun = "median")
median(high_eval$MaxFeatDist/compactness_high)
#> [1] 0.009430206
median(high_eval$MaxPolyDist/step)
#> [1] 0.7106335

Medium Compactness

Medium compactness balances weighting between geographic and feature space.

supercells_medium<- supercells(vol, step=step, compactness = compactness_medium, iter=10,clean=FALSE) 
plot(st_geometry(supercells_medium))

medium_eval<- eval_compactness(vol, supercells_medium, dist_fun = "euclidean", avg_fun = "median")
median(medium_eval$MaxFeatDist/compactness_medium)
#> [1] 0.7645443
median(medium_eval$MaxPolyDist/step)
#> [1] 0.8860023

Low Compactness

Low compactness weights geographic space lower and feature space higher.

supercells_low<- supercells(vol, step=step, compactness = compactness_low, iter=10,clean=FALSE) 
plot(st_geometry(supercells_low))

low_eval<- eval_compactness(vol, supercells_low, dist_fun = "euclidean", avg_fun = "median")
median(low_eval$MaxFeatDist/compactness_low)
#> [1] 55.70342
median(low_eval$MaxPolyDist/step)
#> [1] 1.146726

^{Created on 2024-04-05 with reprex v2.0.2}