JosiahParry / sfdep

A tidy interface for spatial dependence
https://sfdep.josiahparry.com/
GNU General Public License v3.0
121 stars 5 forks source link

inconsistent results between emerging_hotspot_analysis() and mannually calculating hotspots step-by-step #41

Closed johnForne closed 7 months ago

johnForne commented 7 months ago

Kia ora Josiah

Thanks for your work on sfdep:: I was happy to see an sf-friendly package.

However, I've discovered that there are some inconsistencies that I can't explain between what I get if I use the emerging_hotspot_analysis() and try to do the same thing step by step.

I note that the "include_gi = TRUE," argument doesn't seem to work. So I wasn't able to really dig into where/how the inconsistencies came in...

The following example uses the guerry dataset and shows that I get a different picture depending on whether I use emerging_hotspot_analysis() (top map) vs doing it step-by-step (lower map).

plot_zoom_png

plot_zoom_png (3)

Can you please resolve the issue with the argument include_gi?

Can you also please clarify why my results are so different? I'm looking to use the function in some published indicators I'm working on - but need to have confidence that they match what I get through step-by-step.

Thanks heaps,

John

library(sf)
library(sfdep)
library(simplevis)

# replicate the guerry dataset 10 times
x <- purrr::map_dfr(1:10, ~guerry)  %>%  
  dplyr::select(code_dept, crime_pers) %>% 
  # create an indicator for time period
  mutate(time_period = sort(rep(1:10, 85)), 
         # add some noise 
         crime_pers = crime_pers * runif(850, max = 2))

x

# check what this looks like for one time period
x %>% 
  filter(time_period == 1) %>% 
  st_set_crs(27572) %>% 
  simplevis::gg_sf_col(col_var = crime_pers)

# test using emerging_hotspot_analysis() which first requires creating a spacetime object...
spt <- as_spacetime(x, "code_dept", "time_period")

ehsa_guerry <-
  emerging_hotspot_analysis(spt,
                            "crime_pers",
                            include_gi = TRUE, # note the include_gi argument doesn't seem to work.
                            threshold = 0.05)  %>% 
  mutate(classification = factor(
    classification,
    levels = c(
      "consecutive hotspot",
      "sporadic hotspot",
      "new hotspot",
      "new coldspot",
      "sporadic coldspot",
      "consecutive coldspot",
      "persistent coldspot",
      "no pattern detected"
      ))) 

rcartocolor::carto_pal(n = 9, name = "Prism") %>% 
  scales::show_col()

p <- c(rev(c("#CC503E", "#E17C05", "#EDAD08", "#73AF48", "#0F8554", "#38A6A5", "#1D6996")),"grey")

guerry %>% 
  left_join(ehsa_guerry, by = c("code_dept" = "location")) %>% 
  st_set_crs(27572) %>% 
  simplevis::gg_sf_col(col_var = classification,
                       pal = p,
                       title = "Emerging hotspots - using sfdep::emerging_hotspot_analysis()")

# compare what the output from emerging_hotspot_analysis() looks like relative to doing it step-by-step----
# identify neighbours and assign weights----
x_ <- x %>%
  mutate(nb = include_self(st_contiguity(geometry)),
         wt = st_weights(nb)) %>% 
  split(list(.$time_period))

# calculate local_gstar----
guerry_gi_star <- x_ %>% 
  map(. %>% mutate(nb = include_self(st_contiguity(geometry)),
                   wt = st_weights(nb)) %>% transmute(gi_star = local_gstar(crime_pers, nb, wt))) %>% 
  map(. %>% tidyr::unnest(gi_star)) %>% 
  map2(names(.), ~ mutate(.x, time_period = .y)) %>% 
  map(. %>% mutate(code_dept = x_[[1]]$code_dept)) %>% 
  map(. %>% relocate(code_dept, everything())) %>% 
  bind_rows()

# map it for one time period----
guerry_gi_star %>% 
  st_set_crs(27572) %>% 
  filter(time_period == "1") %>% 
  simplevis::gg_sf_col(col_var = gi_star)

# and now calculate the trends----
guerry_gi_star_l <- guerry_gi_star %>% 
  st_drop_geometry() %>% 
  split(list(.$code_dept))

mk <- guerry_gi_star_l %>%
  furrr::future_map( ~ .x %>%
                       pull(gi_star) %>%
                       mmkh() %>% as.data.frame() %>% rownames_to_column(var = "output") %>% pivot_wider(names_from = "output",
                                                                                                         values_from = ".")
  )

mk_ <- mk %>%
  bind_rows(.id = "code_dept") %>%
  dplyr::select(-c(`Original Z`, `old P.value`, old.variance)) %>%
  janitor::clean_names() %>% 
  mutate(code_dept = as.factor(code_dept))

mk_

guerry_gi_star_mk <- guerry_gi_star %>%
  left_join(mk_, by = c("code_dept"))

# categories adapted from https://pro.arcgis.com/en/pro-app/latest/tool-reference/space-time-pattern-mining/learnmoreemerging.htm----
guerry_gi_star_mk_cat <- guerry_gi_star_mk %>%
  mutate(
    hotspot = case_when(
      gi_star <= -2.58 ~ "Cold spot 99%",
      gi_star > -2.58 & gi_star <= -1.645 ~ "Cold spot 90%",
      gi_star > -1.645 & gi_star <= -0.9542 ~ "Cold spot 66%",
      gi_star >= 0.9542 & gi_star < 1.645 ~ "Hot spot 66%",
      gi_star >= 1.645 & gi_star < 2.58 ~ "Hot spot 90%",
      gi_star >= 2.58 ~ "Hot spot 99%",
      TRUE ~ "Not Significant"
    )
  ) %>% # adapted from https://crd230.github.io/lab7.html#Local_spatial_autocorrelation and adapted to align this with IPCC thresholds
  mutate(hotspot = factor(
    hotspot,
    levels = c(
      "Cold spot 99%",
      "Cold spot 90%",
      "Cold spot 66%",
      "Not Significant",
      "Hot spot 66%",
      "Hot spot 90%",
      "Hot spot 99%"
    )
  )) %>%
  group_by(code_dept) %>%
  arrange(code_dept, time_period) %>%
  mutate(
    hot = case_when(hotspot %in% c(# "Hot spot 66%", # remove to aligned with sfdep::emerging_hotspot_analysis() which seems to use a default threshold of 0.01.
      "Hot spot 90%",
      "Hot spot 99%") ~ 1,
      TRUE ~ 0),
    cold = case_when(hotspot %in% c(# "Cold spot 66%", # remove to aligned with sfdep::emerging_hotspot_analysis() which seems to use a default threshold of 0.01.
      "Cold spot 90%",
      "Cold spot 99%") ~ 1,
      TRUE ~ 0),
    prev_hot = lag(hot),
    prev_cold = lag(cold),
    n_hot = sum(hot),
    p_hot = n_hot / length(hot),
    n_cold = sum(cold),
    p_cold = n_cold / length(cold),
    next_h = case_when(hot != lag(hot) ~ 1),
    n_changes_hot = sum(next_h, na.rm = TRUE),
    next_c = case_when(cold != lag(cold) ~ 1),
    n_changes_cold = sum(next_c, na.rm = TRUE),
    cat = case_when(
      last(hot) == 1 &
        n_hot == 1 ~ "new hotspot",
      # A location that is a statistically significant hot spot for the final time step and has never been a statistically significant hot spot before.
      last(hot) == 1 &
        n_hot == 2 &
      p_hot < 0.9 ~ "consecutive hotspot",
      # A location with a single uninterrupted run of at least two statistically significant hot spot bins in the final time-step intervals. The location has never been a statistically significant hot spot prior to the final hot spot run and less than 90 percent of all bins are statistically significant hot spots.
      last(hot) == 1 &
        n_changes_hot >= 3 &
        p_hot < 0.9 &
        n_cold == 0 ~ "sporadic hotspot",
      # A statistically significant hot spot for the final time-step interval with a history of also being an on-again and off-again hot spot (i.e. there are not more than two consecutive hotspots). Less than 90 percent of the time-step intervals have been statistically significant hot spots and none of the time-step intervals have been statistically significant cold spots.
      p_hot >= 0.9 &
        last(hot) == 1 &
        tau > 0 &
        new_p_value <= 0.333 ~ "intensifying hotspot",
      # A location that has been a statistically significant hot spot for 90 percent of the time-step intervals, including the final time step. In addition, the intensity of clustering of high counts in each time step is increasing overall and that increase is statistically significant.
      p_hot >= 0.9 &
        new_p_value > 0.333 ~ "persistent hotspot",
      # A location that has been a statistically significant hot spot for 90 percent of the time-step intervals with no discernible trend in the intensity of clustering over time.
      p_hot >= 0.9 &
        new_p_value <= 0.333 &
        tau < 0 &
        last(hot == 1) ~ "diminishing hotspot",
      # A location that has been a statistically significant hot spot for 90 percent of the time-step intervals, including the final time step. In addition, the intensity of clustering in each time step is decreasing overall and that decrease is statistically significant.
      last(hot == 0) & p_hot >= 0.9 ~ "historic hotspot",
      # The most recent time period is not hot, but at least 90 percent of the time-step intervals have been statistically significant hot spots.
      last(cold) == 1 &
        n_cold == 1 ~ "new coldspot",
      # A location that is a statistically significant cold spot for the final time step and has never been a statistically significant cold spot before.
      last(hot) == 1 &
        n_hot == 2 &
      p_cold < 0.9 ~ "consecutive coldspot",
      # A location with a single uninterrupted run of at least two statistically significant cold spot bins in the final time-step intervals. The location has never been a statistically significant cold spot prior to the final cold spot run and less than 90 percent of all bins are statistically significant cold spots.
      last(cold) == 1 &
        n_changes_cold >= 3 &
        p_cold < 0.9 &
        n_hot == 0 ~ "sporadic coldspot",
      # A statistically significant cold spot for the final time-step interval with a history of also being an on-again and off-again cold spot (i.e. there are not more than two consecutive coldspots). Less than 90 percent of the time-step intervals have been statistically significant cold spots and none of the time-step intervals have been statistically significant hot spots.
      p_cold >= 0.9 &
        last(cold) == 1 &
        tau < 0 &
        new_p_value <= 0.333 ~ "intensifying coldspot",
      # A location that has been a statistically significant cold spot for 90 percent of the time-step intervals, including the final time step. In addition, the intensity of clustering of low counts in each time step is increasing overall and that increase is statistically significant.
      p_cold >= 0.9 &
        new_p_value > 0.333 ~ "persistent coldspot",
      # A location that has been a statistically significant cold spot for 90 percent of the time-step intervals with no discernible trend in the intensity of clustering over time.
      p_cold >= 0.9 &
        new_p_value <= 0.333 &
        tau < 0 &
        last(cold == 1) ~ "diminishing coldspot",
      # A location that has been a statistically significant cold spot for 90 percent of the time-step intervals, including the final time step. In addition, the intensity of clustering in each time step is decreasing overall and that decrease is statistically significant.
      last(cold == 0) & p_cold >= 0.9 ~ "historic coldspot"
    ),
    # The most recent time period is not cold, but at least 90 percent of the time-step intervals have been statistically significant cold spots.
  ) %>%
  mutate(cat = replace_na(cat, "no pattern detected")) %>%
  ungroup() %>%
  mutate(cat = factor(
    cat,
    levels = c(
      "consecutive hotspot",
      "sporadic hotspot",
      "new hotspot",
      "new coldspot",
      "sporadic coldspot",
      "consecutive coldspot",
      "persistent coldspot",
      "no pattern detected"
    )
  )) 

# map it----
guerry_gi_star_mk_cat %>% 
  group_by(code_dept) %>% 
  slice(1) %>% 
  st_set_crs(27572) %>% 
  simplevis::gg_sf_col(col_var = cat,
                             pal = p,
                             title = "Emerging hotspots - manually using sfdep::local_gstar, mann_kendall, etc.")

# mmm... so there appear to be lots of differences between the two methods... Though it is unclear how emerging_hotspot_analysis() categorises the data so a more robust comparision is to look at the tau and p-values (noting that the include_gi = TRUE argument doesn't seem to be working so we can't compare gi_star values)

mk <- guerry_gi_star_l %>%
  furrr::future_map( ~ .x %>%
                       values_from = ".")
                       mutate(mk = mann_kendall(gi_star, alternative = "greater")))

mk_ <- mk %>%
  bind_rows(.id = "code_dept") %>%
  janitor::clean_names() %>% 
  mutate(code_dept = as.factor(code_dept))

mk_ %>% 
  group_by(code_dept) %>% 
  slice(1) %>% 
  unnest() %>% 
  dplyr::select(code_dept, gi_star, p_value, tau) %>% 
  left_join(ehsa_guerry, by = c("code_dept" = "location")) %>% # note I'm pretty sure that emerging_hotspot_analyais() calls MannKendall() which returns a two-sided p-value.
  ungroup() %>% 
  summarise(sbs_n_p_less_05 = sum(p_value.x < 0.05),
            ehsa_n_p_less_05 = sum((p_value.y/2) < 0.05), # I'm dividing by 2 to turn a two-sided p-value into a one-sided p-value
            dif_tau = sum(tau.x != tau.y),
            n_sbs_positive_tau = sum(tau.x > 0),
            n_sbs_negative_tau = sum(tau.x < 0),
            n_ehsa_positive_tau = sum(tau.y > 0),
            n_ehsa_negative_tau = sum(tau.y < 0),
            n_diff_pos_neg = sum(tau.x > 0 & tau.y < 0 | tau.x < 0 & tau.y > 0 ),
            n_dif_ehsa_sig = sum((p_value.y/2) < 0.05 & (tau.x > 0 & tau.y < 0 | tau.x < 0 & tau.y > 0 )))

# note it is unclear whether the ehsa_guerry p_value relates to the gi_star or mannKendall result... I suspect is relates to the gi_star...

# So there are 16 code_dept(s) that ehsa finds have a significant p-value (< 0.05) but that have different tau values. This seems to suggest that over half the estimated slopes might be wrong??? That said it is apparent from below that there are only five polygons/code_dept that have a statistically significant (p-value <= 0.1) trend and of these five none had more than 0.5 of the timeseries that were hotspots (they needed to have at least 0.9 of the timesteps as hotspots). Hence none of them were intensifying or diminishing.

guerry_gi_star_mk_cat %>% 
  filter(new_p_value <= 0.05) %>% 
  # distinct(code_dept)
  View()

# ok so there are only five polygons/code_dept that have a statistically significant (p-value <= 0.1) trend and of these five none had more than 0.5 of the timeseries that were hotspots. Hence none of them were intensifying or diminishing.
JosiahParry commented 7 months ago

Thanks @johnForne! Unfortunately I cannot replicate your code here.
A few things here. Firstly, your code does not replicate emerging hotspot analysis as it should be done. emerging hot spot analysis includes neighbors from different time lags. That means for code dept 1, its neighbors at the present time and the neighbors at a previous time are included in the calculation of the Gi*. By default k = 1 and not 0.

I'm also not so confident that your classification scheme is entirely accurate. These are fairly complex and you find them at https://github.com/JosiahParry/sfdep/blob/main/R/emerging-hostpot-classifications.R

I also had to write a specific version of the Gi* to be able to include time neighbors as well so that's definitely a place where things can be come different. https://github.com/JosiahParry/sfdep/blob/main/R/spacetime-local-gistar-impl.R

I also think you're using a different version of the MK statistic than I (you're using, I think, https://rdrr.io/cran/modifiedmk/)

To use the output of include_gi you need to get it from an attribute. The attribute from the result can be gotten like so:

library(sfdep)

df_fp <- system.file("extdata", "bos-ecometric.csv", package = "sfdep")
geo_fp <- system.file("extdata", "bos-ecometric.geojson", package = "sfdep")

# read in data
df <- read.csv(df_fp, colClasses = c("character", "character", "integer", "double", "Date"))
geo <- sf::st_read(geo_fp)
#> Reading layer `bos-ecometric' from data source 
#>   `/Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/sfdep/extdata/bos-ecometric.geojson' 
#>   using driver `GeoJSON'
#> Simple feature collection with 168 features and 1 field
#> Geometry type: POLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -71.19115 ymin: 42.22788 xmax: -70.99445 ymax: 42.3974
#> Geodetic CRS:  NAD83

# Create spacetime object called `bos`
bos <- spacetime(df, geo,
                 .loc_col = ".region_id",
                 .time_col = "time_period")

# conduct EHSA
ehsa <- emerging_hotspot_analysis(
  x = bos,
  .var = "value",
  k = 1,
  nsim = 9,
  include_gi = TRUE
)

attr(ehsa, "gi_star")
#>            gi_star p_sim
#> 1    -0.9851338521   0.1
#> 2    -1.1651720095   0.1
#> 3    -1.2901364459   0.1
#> 4    -0.9406102825   0.1
#> 5    -1.1077628305   0.1

Created on 2023-11-28 with reprex v2.0.2

johnForne commented 7 months ago

Kia ora Josiah

Thanks for getting back to me so quickly - this is very valuable and is helping me understand what's happening under the hood.

Apologies if I'm missing something as I'm relatively new to this analysis, but I'd like to better understand why you seem to be suggesting that emerging hotspot analysis requires including neighbours from different time lags. As far as I understand, what we're trying to do is calculate where hot/cold spots are in a single time slice and to then analyse how the localg (z-score) for that geography changes over time. By comparing localg values over time doesn't this effectively compare say, polygon x (and it's neighbours) in time slice 1 with polygon x (and its neighbours) in time slice 2, etc.

Or are you saying that emerging_hotspot_analysis(k = 1) works by smoothing both in space and time. This seems a novel idea - do you know how many other people introduce a k type argument. E.g. does ArcGIS emerging hotspot tool do this? Or is this something that you've developed? Are there any papers on this?

From https://github.com/JosiahParry/sfdep/blob/main/R/emerging-hostpot-analysis.R I've read that "In addition to identifying neighbors in space, emerging hotspot analysis also incorporates the same observations from k periods ago-called a time lag. If the time lag k is 1 and the unit of time is month, the neighbors for the calculation of Gi* would include the spatial neighbors' values at time t and the same spatial neighbors' values at time t-1. If k = 2, it would include t, t-1, and t-2."

I've tested adding in the argument "k=0" to emerging_hotspot_anlysis() and can see that the results differ from when I use the default "k=1". However, they're still not the same.

You're right my code used a different version of the MK statistic, I did use the mmkh(). However, I've revised the code to use the MannKendall() (alternative = "greater").

However, even if I set k = 0 and use the standard MannKendall() my figures/maps still don't match (see below).

I then tried using the classify_hotspots(threshold = 0.1) from https://github.com/JosiahParry/sfdep/blob/main/R/emerging-hostpot-classifications.R - I still can't get my maps/figures to match! In order to use classify_hotspots() I found that I needed to use local_gstar_perm() not local_gstar() as I had been... I noticed that there is a p_sim variable and that p_sim seems to be an output from local_gstar_perm() but not from local_gstar() - and that this p_sim was needed for classify_hotspots(). I'm not sure how to determine what to set n_sim argument to - but saw that you use 499 in one of your examples so I followed this. I noticed that the gi_star value differed depending on whether I used local_gstar() or local_gstar_perm().

I would like to understand the when I should use local_gstar_perm() and when local_gstar()?

Also I'd be interested in your thoughts as to why I still can't get matching figures/maps even if

1. I set both to k = 0 (for comparability I think it makes sense to set to k =0 (even though you seem to be saying that k=1 is better) because I don't think there is a k argument when I calculate manually/step-by-step 2. I use local_gstar_perm() when I run manually/step-by-step 3. I use standard MannKendall() (not mmkh().

Please let me know why you can't replicate my code as this might help use get on the same page???

Thanks again,

John

[cid:da17fa9d-427c-4ade-a2ea-851e05430a4c] [cid:834c63dc-9584-4c94-af7e-3c7ed8c5184f] [cid:728f1220-a2a4-472d-bb6b-1d6a6503955f]


From: Josiah Parry @.> Sent: Wednesday, November 29, 2023 2:01 AM To: JosiahParry/sfdep @.> Cc: John Forne @.>; Mention @.> Subject: Re: [JosiahParry/sfdep] inconsistent results between emerging_hotspot_analysis() and mannually calculating hotspots step-by-step (Issue #41)

Thanks @johnFornehttps://github.com/johnForne! Unfortunately I cannot replicate your code here. A few things here. Firstly, your code does not replicate emerging hotspot analysis as it should be done. emerging hot spot analysis includes neighbors from different time lags. That means for code dept 1, its neighbors at the present time and the neighbors at a previous time are included in the calculation of the Gi*. By default k = 1 and not 0.

I'm also not so confident that your classification scheme is entirely accurate. These are fairly complex and you find them at https://github.com/JosiahParry/sfdep/blob/main/R/emerging-hostpot-classifications.R

I also had to write a specific version of the Gi* to be able to include time neighbors as well so that's definitely a place where things can be come different. https://github.com/JosiahParry/sfdep/blob/main/R/spacetime-local-gistar-impl.R

I also think you're using a different version of the MK statistic than I (you're using, I think, https://rdrr.io/cran/modifiedmk/)

To use the output of include_gi you need to get it from an attribute. The attribute from the result can be gotten like so:

library(sfdep)

df_fp <- system.file("extdata", "bos-ecometric.csv", package = "sfdep") geo_fp <- system.file("extdata", "bos-ecometric.geojson", package = "sfdep")

read in data

df <- read.csv(df_fp, colClasses = c("character", "character", "integer", "double", "Date")) geo <- sf::st_read(geo_fp)

> Reading layer `bos-ecometric' from data source

> `/Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/sfdep/extdata/bos-ecometric.geojson'

> using driver `GeoJSON'

> Simple feature collection with 168 features and 1 field

> Geometry type: POLYGON

> Dimension: XY

> Bounding box: xmin: -71.19115 ymin: 42.22788 xmax: -70.99445 ymax: 42.3974

> Geodetic CRS: NAD83

Create spacetime object called bos

bos <- spacetime(df, geo, .loc_col = ".region_id", .time_col = "time_period")

conduct EHSA

ehsa <- emerging_hotspot_analysis( x = bos, .var = "value", k = 1, nsim = 9, include_gi = TRUE )

attr(ehsa, "gi_star")

> gi_star p_sim

> 1 -0.9851338521 0.1

> 2 -1.1651720095 0.1

> 3 -1.2901364459 0.1

> 4 -0.9406102825 0.1

> 5 -1.1077628305 0.1

Created on 2023-11-28 with reprex v2.0.2https://reprex.tidyverse.org

— Reply to this email directly, view it on GitHubhttps://github.com/JosiahParry/sfdep/issues/41#issuecomment-1829797824, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AO3QSUOSWVMVRK7YFX3YBSDYGXOENAVCNFSM6AAAAAA75TFUA2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRZG44TOOBSGQ. You are receiving this because you were mentioned.

This email is from an external sender. Do not click links or open attachments unless you recognise the sender and know the content is safe

JosiahParry commented 7 months ago

When creating a reproducible example, I'd recommend using reprex so that you can ensure that your code is reproducible. Your code uses a version of simplevis that has something called gg_sf_col() which is not available in the current release. It also does not load the library that the mmkh() is from (or call it by namespace), and some other things. Again, reprex is super handy here!

I took no creative direction with this implementation of Emerging Hot Spot Analysis and having since joined the team, I have confirmed that the way that neighbors and weights are used are correct. Neighbors are identified and weighted based on space. Then neighbors from k time lags are included in the calculation of the statistic.

A couple reasons why your code is likely not reproducing the results:

You can fork the repository and explore the local_g_spt_calc() and functions like that. The package puts a lot of work into making sure that neighbors from the right time windows are fetched and used. The differences you are seeing are from a very slight difference in the calculation of the local G. The values are quite similar but slightly different. The p-values are also going to be different each run because of conditional permutation.

You can read my blog post on complete spatial randomness if you'd like (ugh looks like quarto broke my images!!! )

library(sfdep)
library(dplyr)

x <- purrr::map_dfr(1:10, ~guerry)  %>%  
  dplyr::select(code_dept, crime_pers) %>% 
  # create an indicator for time period
  mutate(
    time_period = sort(rep(1:10, 85)),
    crime_pers = crime_pers * runif(850, max = 2)
  )

xnb <- x |> 
  group_by(time_period) |> 
  mutate(
    nb = include_self(st_contiguity(geometry)),
    wt = st_weights(nb)
  ) |> 
  ungroup()

all_gis <- xnb |> 
  sf::st_drop_geometry() |> 
  group_by(time_period) |> 
  mutate(
    gi_star = local_gstar_perm(crime_pers, nb, wt)
  ) |> 
  tidyr::unnest_wider(gi_star) 

ehsa <- all_gis |> 
  group_by(code_dept) |> 
  summarise(mk = as.data.frame(unclass(Kendall::MannKendall(gi_star))))

ehsa
#> # A tibble: 85 × 2
#>    code_dept  mk$tau    $sl    $S    $D $varS
#>    <fct>       <dbl>  <dbl> <dbl> <dbl> <dbl>
#>  1 01         0.511  0.0491    23    45   125
#>  2 02         0.244  0.371     11    45   125
#>  3 03         0.0222 1          1    45   125
#>  4 04        -0.111  0.721     -5    45   125
#>  5 05        -0.200  0.474     -9    45   125
#>  6 07         0.289  0.283     13    45   125
#>  7 08         0.111  0.721      5    45   125
#>  8 09        -0.244  0.371    -11    45   125
#>  9 10        -0.111  0.721     -5    45   125
#> 10 11        -0.200  0.474     -9    45   125
#> # ℹ 75 more rows

# EHSA --------------------------------------------------------------------

spt <- as_spacetime(x, "code_dept", "time_period")

# use sfdep
ehsa_guerry <-
  emerging_hotspot_analysis(
    spt,
    "crime_pers",
    include_gi = TRUE,
    threshold = 0.05,
    nsim = 99
  ) 

ehsa_guerry
#> # A tibble: 85 × 4
#>    location     tau p_value classification     
#>    <fct>      <dbl>   <dbl> <chr>              
#>  1 01        0.467   0.0736 sporadic hotspot   
#>  2 02        0.467   0.0736 sporadic hotspot   
#>  3 03        0.289   0.283  sporadic hotspot   
#>  4 04       -0.244   0.371  sporadic coldspot  
#>  5 05       -0.244   0.371  no pattern detected
#>  6 07        0.333   0.210  no pattern detected
#>  7 08        0.378   0.152  sporadic hotspot   
#>  8 09       -0.422   0.107  sporadic coldspot  
#>  9 10       -0.0222  1      sporadic hotspot   
#> 10 11       -0.378   0.152  sporadic coldspot  
#> # ℹ 75 more rows

Created on 2023-11-29 with reprex v2.0.2

johnForne commented 7 months ago

Kia ora Josiah

Thanks heaps for your thorough and informative answers. I've shared with the team and we're learning and figuring out how to use sfdep:: nicely.

Apologies for the delay in replying and for not providing a reproducible example. I'll take up your suggestion and will look into using reprex. Simplevis:: was developed by someone who used to be in our team and we use an old version of this. Anyhow...

We're looking to analyse emerging hotspots using local g star. We're still not 100% sure whether we'll simply use sfdep::emerging_hotspot_analysis(k = 1) or whether we're use sfdep:: functions more manually. With the reason for doing this process more manually because we're interested in using a Bayesian model to analyse the trends (rather than a Mann Kendall test). However, following your comments below we've discovered that it is potentially complicated to do it step-by-step...

I suspect that we should be using local_g_spt() rather than local_gstar_perm() as local_g_spt() as it seems to account for the fact that the assumption of normality with spatial data is often not meet (which local_gstar_perm() does - but it also that it

1. works on a space time object (not just a single timeslice) - one of the arguments in local_g_spt() seems to be time lagged spatial neighbours from spt_nb(). Though I'm unclear how this actually works. spt_nb seems to include k = 1. Which I understand means that gi_star is calculated for a neighbourhood both in space and time. What I'm unclear about though is how it calculates the "global" mean (whether its all hexagons from time=t, hexagons from time=t and t-1, or hexagons from all time points ie the whole dataset)??? Lauren Scotthttps://www.youtube.com/watch?v=9VDRYBvOoDI&list=PLGZUzt4E4O2LuV0vuH74WN6j9nxv0jUty&index=10 (from around 5':38'') seems to suggest that when defining the global window there are three options (neighbourhoods are compared with: i. the entire cube; ii. ). Option one (i) seems to smooth too much and doesn't seem preferrable. Instinctively, it seems to make more sense to compare with bins in the same time period (including t-1) only. The issue is that it is not clear to me at least from https://github.com/JosiahParry/sfdep/blob/main/R/emerging-hostpot-analysis.R what global is used... Can you please clarifiy?

I'm also a bit unclear about how to use the gi_star value from local_g_spt(). From your video (https://www.youtube.com/watch?v=OnMNZwJywjs) it is apparent that I should use both the z-score (gi_star) values and the p_folded_sim values to categories the hotspots (into those that are statistically significant vs those that aren't). I think I would then calculate trends on these gi_star values (that are smoothed in space and time (k = 1) using a Bayesian model. And finally we could then use classify_hotspot() to categorise.

Thanks in again for your help - much appreciated,

John [cid:ee719990-af6d-4eab-853f-e37127da453d]https://www.youtube.com/watch?v=OnMNZwJywjs   Hot Spot Analysis in R: GIS Fundamentalshttps://www.youtube.com/watch?v=OnMNZwJywjs Hot Spot Analysis is one of the most common uses of local indicators of spatial analysis (LISA). In this code walk through we conduct a hot spot analysis using the Getis-Ord Local G (or Gi / Gi*) of Robbery in the metro Atlanta area using the R packages sfdep, dplyr, and ggplot2. We conclude by making a map of crime hot and cold spots. GitHub ... www.youtube.com  

[cid:bd249373-f544-44f3-8216-6177eb419f8a]


From: Josiah Parry @.> Sent: Thursday, November 30, 2023 3:26 AM To: JosiahParry/sfdep @.> Cc: John Forne @.>; Mention @.> Subject: Re: [JosiahParry/sfdep] inconsistent results between emerging_hotspot_analysis() and mannually calculating hotspots step-by-step (Issue #41)

When creating a reproducible example, I'd recommend using reprexhttps://reprex.tidyverse.org/ so that you can ensure that your code is reproducible. Your code uses a version of simplevis that has something called gg_sf_col() which is not available in the current release. It also does not load the library that the mmkh() is from (or call it by namespace), and some other things. Again, reprex is super handy here!

I took no creative direction with this implementation of Emerging Hot Spot Analysis and having since joined the team, I have confirmed that the way that neighbors and weights are used are correct. Neighbors are identified and weighted based on space. Then neighbors from k time lags are included in the calculation of the statistic.

A couple reasons why your code is likely not reproducing the results:

You can fork the repository and explore the local_g_spt_calc() and functions like that. The package puts a lot of work into making sure that neighbors from the right time windows are fetched and used. The differences you are seeing are from a very slight difference in the calculation of the local G. The values are quite similar but slightly different. The p-values are also going to be different each run because of conditional permutation.

You can read my blog posthttps://josiahparry.com/posts/csr on complete spatial randomness if you'd like (ugh looks like quarto broke my images!!! )

library(sfdep) library(dplyr)

x <- purrr::map_dfr(1:10, ~guerry) %>% dplyr::select(code_dept, crime_pers) %>%

create an indicator for time period

mutate( time_period = sort(rep(1:10, 85)), crime_pers = crime_pers * runif(850, max = 2) )

xnb <- x |> group_by(time_period) |> mutate( nb = include_self(st_contiguity(geometry)), wt = st_weights(nb) ) |> ungroup()

all_gis <- xnb |> sf::st_drop_geometry() |> group_by(time_period) |> mutate( gi_star = local_gstar_perm(crime_pers, nb, wt) ) |> tidyr::unnest_wider(gi_star)

ehsa <- all_gis |> group_by(code_dept) |> summarise(mk = as.data.frame(unclass(Kendall::MannKendall(gi_star))))

ehsa

> # A tibble: 85 × 2

> code_dept mk$tau $sl $S $D $varS

>

> 1 01 0.511 0.0491 23 45 125

> 2 02 0.244 0.371 11 45 125

> 3 03 0.0222 1 1 45 125

> 4 04 -0.111 0.721 -5 45 125

> 5 05 -0.200 0.474 -9 45 125

> 6 07 0.289 0.283 13 45 125

> 7 08 0.111 0.721 5 45 125

> 8 09 -0.244 0.371 -11 45 125

> 9 10 -0.111 0.721 -5 45 125

> 10 11 -0.200 0.474 -9 45 125

> # ℹ 75 more rows

EHSA --------------------------------------------------------------------

spt <- as_spacetime(x, "code_dept", "time_period")

use sfdep

ehsa_guerry <- emerging_hotspot_analysis( spt, "crime_pers", include_gi = TRUE, threshold = 0.05, nsim = 99 )

ehsa_guerry

> # A tibble: 85 × 4

> location tau p_value classification

>

> 1 01 0.467 0.0736 sporadic hotspot

> 2 02 0.467 0.0736 sporadic hotspot

> 3 03 0.289 0.283 sporadic hotspot

> 4 04 -0.244 0.371 sporadic coldspot

> 5 05 -0.244 0.371 no pattern detected

> 6 07 0.333 0.210 no pattern detected

> 7 08 0.378 0.152 sporadic hotspot

> 8 09 -0.422 0.107 sporadic coldspot

> 9 10 -0.0222 1 sporadic hotspot

> 10 11 -0.378 0.152 sporadic coldspot

> # ℹ 75 more rows

Created on 2023-11-29 with reprex v2.0.2https://reprex.tidyverse.org

— Reply to this email directly, view it on GitHubhttps://github.com/JosiahParry/sfdep/issues/41#issuecomment-1831998330, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AO3QSUPVBTMH2CF3Q6GA3J3YG5AZHAVCNFSM6AAAAAA75TFUA2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZRHE4TQMZTGA. You are receiving this because you were mentioned.

This email is from an external sender. Do not click links or open attachments unless you recognise the sender and know the content is safe

JosiahParry commented 7 months ago

The tool uses the Individual time step windowing method and does not calculate Gi* in the context of the whole cube.

local_gstar_perm() also handles the lack of assumption of normality. This is done through the conditional permutation approach of calculating pseudo p-values. local_g_spt is not exported (i dont think) and is intended strictly for the use case of spacetime objects.

The pseudo p-values (simulated) are used to classify the hotspots (e.g. was it significant 5 times or fewer type of thing). The Gi star value determines the type of hotspot (negative cold positive hot). You calculate the trend itself on the Gi values themselves. So you chunk your data into k time steps calculate your statistic and then do a trend test on the statistic itself. :)

I hope that helps :)

johnForne commented 7 months ago

Amazing 😁 thank you so much! This is very helpful. To be clear can you confirm that local_gstar_perm() takes a k argument and takes a space time neighbours ( I.e. those across year1 and year 1-1 when k =1) or whether i have to create a space time neighbourhood using spt_nb() and that this can then be passed to local_gstar_perm()????

Also when you say Individual time stepwindowing method can you please confirm that this calculates gi_star for the years and year I-1?

Thanks heaps,

John

Get Outlook for iOShttps://aka.ms/o0ukef


From: Josiah Parry @.> Sent: Saturday, December 2, 2023 5:21:13 AM To: JosiahParry/sfdep @.> Cc: John Forne @.>; Mention @.> Subject: Re: [JosiahParry/sfdep] inconsistent results between emerging_hotspot_analysis() and mannually calculating hotspots step-by-step (Issue #41)

The tool uses the Individual time step windowing method and does not calculate Gi* in the context of the whole cube.

local_gstar_perm() also handles the lack of assumption of normality. This is done through the conditional permutation approach of calculating pseudo p-values. local_g_spt is not exported (i dont think) and is intended strictly for the use case of spacetime objects.

The pseudo p-values (simulated) are used to classify the hotspots (e.g. was it significant 5 times or fewer type of thing). The Gi star value determines the type of hotspot (negative cold positive hot). You calculate the trend itself on the Gi values themselves. So you chunk your data into k time steps calculate your statistic and then do a trend test on the statistic itself. :)

I hope that helps :)

— Reply to this email directly, view it on GitHubhttps://github.com/JosiahParry/sfdep/issues/41#issuecomment-1836402093, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AO3QSUJSWXXLWTCHSO6P4RLYHH7XTAVCNFSM6AAAAAA75TFUA2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZWGQYDEMBZGM. You are receiving this because you were mentioned.Message ID: @.***>

This email is from an external sender. Do not click links or open attachments unless you recognise the sender and know the content is safe

johnForne commented 7 months ago

Kia ora Josiah

Apologies for all my emails asking for help.

In addition to my latest email below (asking for confirmation that the Individual time step windowing method you mention means local_gstar_perm() calculates gi_star for the yeari and yeari-1) can you please help me with the following question?

Thanks again!,

John

  1. how should I categorise p_folded_sim and gi_star values?

I think it seems to make the most sense to use local_gstar_perm() - rather than local_gstar() which I was using initially. I gather ocal_gstar_perm() seemshttps://sfdep.josiahparry.com/articles/conditional-permutation to account for the fact that the assumption of normality with spatial data are often not meet. However, local_gstar_perm() returns both a gi_star value and other parameters, including p_folded_sim. My issues is that I am unclear how to use these variables. In particular, rather than simply having a gi_star value that we can categorise into different confidence bands for hot or cold spots - it seems from https://sfdep.josiahparry.com/articles/understanding-emerging-hotspots.html that we should treat the results of local_gstar_perm() slightly differently. In particular, that we should categorise based on the p_folded_sim for confidence and the sign of the gi_star value for whether it is a cluster of high or low values.

This then got me thinking whether we should use the p-values in this way. In particular, the local_gstarperm() includes an argument for "alternative". And by default this "alternative = "two.sided". It seems to me that if the alternative is set to two.sided - then we would need to divide the p(folded_sim)value by two in order to categorise the confidence that a positive (or negative) gi_star value is indicating a trend??? Yet, from what I can see of the documentation for sfdep:: it seems that the p-value coming out of local_gstar_perm() is two.sided yet it is interpreted as if it was the p-value for a one-sided test.

I've had a play with setting alternative = "greater" and am puzzled... sometimes the p-values from the two-sided test are larger than the p-values from the one-sided test and other times vice versa. I would have expected the p-values from a one-sided test to be the same as the p-values from a two.sided test/2?

Given that ultimately I'm looking to identify (i) whether the polygon is a hotspot/coldspot, and; (ii) the confidence we have that this polygon is a hotspot or coldspot - please let me know whether you think we should set the argument for local_gstar_perm(alternative = "greater"). I'm also wondering whether we need to tweek the way we categorise the p_folded_sim and gi_star values so that if gi_star > 0 then p_folded_sim values are catorised as is and if gi_star < 0 then p_folded_sim values are "flipped" so that we get the confidence from 1 - p_folded_sim???


From: John Forne @.> Sent: Saturday, December 2, 2023 7:53 AM To: JosiahParry/sfdep @.>; JosiahParry/sfdep @.> Cc: Mention @.> Subject: Re: [JosiahParry/sfdep] inconsistent results between emerging_hotspot_analysis() and mannually calculating hotspots step-by-step (Issue #41)

Amazing 😁 thank you so much! This is very helpful. To be clear can you confirm that local_gstar_perm() takes a k argument and takes a space time neighbours ( I.e. those across year1 and year 1-1 when k =1) or whether i have to create a space time neighbourhood using spt_nb() and that this can then be passed to local_gstar_perm()????

Also when you say Individual time stepwindowing method can you please confirm that this calculates gi_star for the years and year I-1?

Thanks heaps,

John

Get Outlook for iOShttps://aka.ms/o0ukef


From: Josiah Parry @.> Sent: Saturday, December 2, 2023 5:21:13 AM To: JosiahParry/sfdep @.> Cc: John Forne @.>; Mention @.> Subject: Re: [JosiahParry/sfdep] inconsistent results between emerging_hotspot_analysis() and mannually calculating hotspots step-by-step (Issue #41)

The tool uses the Individual time step windowing method and does not calculate Gi* in the context of the whole cube.

local_gstar_perm() also handles the lack of assumption of normality. This is done through the conditional permutation approach of calculating pseudo p-values. local_g_spt is not exported (i dont think) and is intended strictly for the use case of spacetime objects.

The pseudo p-values (simulated) are used to classify the hotspots (e.g. was it significant 5 times or fewer type of thing). The Gi star value determines the type of hotspot (negative cold positive hot). You calculate the trend itself on the Gi values themselves. So you chunk your data into k time steps calculate your statistic and then do a trend test on the statistic itself. :)

I hope that helps :)

— Reply to this email directly, view it on GitHubhttps://github.com/JosiahParry/sfdep/issues/41#issuecomment-1836402093, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AO3QSUJSWXXLWTCHSO6P4RLYHH7XTAVCNFSM6AAAAAA75TFUA2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZWGQYDEMBZGM. You are receiving this because you were mentioned.

This email is from an external sender. Do not click links or open attachments unless you recognise the sender and know the content is safe

JosiahParry commented 7 months ago

I can only recommend that you use emerging_hotspot_analysis() and use the approach there. Alternatively, if you have ArcGIS Pro available to you, you can use the original implementation there.

One last alternative approach you can take is to use the internal function spt_nb() to calculate space time neighbors and pass that to local_gstar_perm(). Below is an example of doing this.

To interpret the scores, you look at the sign of gi_star positive is "hot" and negative is "cold". I recommend you use p_folded_sim since it does not assume normality. You should not divide the p-value by 2. What you have described is a two-tailed test: that it is more extreme than expected under complete spatial randomness in either direction (hot or cold). You should also use a more conservative cutoff than 0.05 such as 0.001 as is recommended by Luc Anselin for conditional permutation approach to p-value calculation.

You should not be using the Gi* statistic to determine whether or not a trend is present. Lastly, its worth noting that the Mann-Kendall stat is non-parametric and has no assumption of normality.

library(sfdep)
library(dplyr)

x <- purrr::map_dfr(1:10, ~guerry)  %>%  
  dplyr::select(code_dept, crime_pers) %>% 
  # create an indicator for time period
  mutate(
    time_period = sort(rep(1:10, 85)),
    crime_pers = crime_pers * runif(850, max = 2)
  )

# calculate neighbors in space
nb <- include_self(st_contiguity(guerry$geometry))

spt_nbs <- sfdep:::spt_nb(
  nb,
  n_times = 10,
  n_locs = length(nb),
  k = 1
) |> 
  # handles bug in the internal function that makes it numeric 
  # not integer
  lapply(as.integer)

# make it into a nb class object 
attributes(spt_nbs) <- attributes(nb)

# calculate row-standardized weights
wts <- st_weights(spt_nbs)

# use local_gstar_p
gis <- local_gstar_perm(x$crime_pers, spt_nbs, wts)

head(gis)
#>       gi_star        e_gi       var_gi    p_value        p_sim p_folded_sim
#> 1  2.81081999 0.001484892 1.041060e-07  2.1986013 0.0279062810        0.048
#> 2  0.07275412 0.001086474 8.546990e-08  0.3839037 0.7010498597        0.684
#> 3  3.07011902 0.001030353 8.024913e-08  3.8285633 0.0001288935        0.008
#> 4 -0.47830940 0.001237108 1.067520e-07 -0.7156852 0.4741857566        0.468
#> 5  0.07303781 0.001317220 1.300446e-07 -0.3082567 0.7578870420        0.808
#> 6 -1.86127253 0.001134215 6.955435e-08 -1.8565154 0.0633801274        0.044
#>   skewness  kurtosis
#> 1    0.024 0.4356863
#> 2    0.342 0.4500153
#> 3    0.004 0.3997048
#> 4    0.234 0.2333309
#> 5    0.404 0.3420084
#> 6    0.022 0.3065077