JosiahParry / sfdep

A tidy interface for spatial dependence
https://sfdep.josiahparry.com/
GNU General Public License v3.0
125 stars 5 forks source link

Dealing with NA's on spacetime objects #53

Open rafalopespx opened 6 months ago

rafalopespx commented 6 months ago

Hi Josiah, thank you for the amazing package! It helped me a lot!

I don't know if this is the right place for this kind of issue, but I think it worth asking.

So, I'm using a lot of function here to work an dataset that might have some time periods of a lot NA on the value variable that I'm interested to produce an Emerging hot spot analysis like here

I've been able to reproduce my error when running this piece of code with random NAs introduced at the value column of the bos spacetime cube.

gi_stars <- bos_nb |> 
  group_by(time_period) |> 
  mutate(gi_star = local_gstar_perm(value, nb, wt)) |> 
  tidyr::unnest(gi_star)

gi_stars |> 
  select(.region_id, time_period, gi_star, p_folded_sim)

and the error as follow:

Error in mutate(): ℹ In argument: gi_star = local_gstar_perm(value, nb, wt). ℹ In group 1: time_period = 2010-01-01. Caused by error in card(): ! not a neighbours list Run rlang::last_trace() to see where the error occurred.

I think the issue here is that as I have NA at the value column this might have introduced some neighbours that aren't any more neighbouring other places, but maybe I'm wrong. I tried alternatives and workaorunds like filtering the bos spacetime before giving to the Gi* pipe and using queen = T and/or allow_zero = T, at st_conguity and st_weights, respectively, But none of this haven't worked, the error returned is the same.

Thanks in advance for any hlep or insights Josiah!

Here I give you my whole pipeline adapated from the vignette:

### Create objects to store file paths 
df_fp <- system.file("extdata", "bos-ecometric.csv", package = "sfdep")
geo_fp <- system.file("extdata", "bos-ecometric.geojson", package = "sfdep")

### read in data 
df <- readr::read_csv(df_fp, col_types = "ccidD")
df <- df |> 
  mutate(value = ifelse(rbinom(n(), 1, 0.1), NA, value))
geo <- sf::st_read(geo_fp)

bos_expanded <- expand.grid(time_period = unique(df$time_period), 
                            .region_id = unique(geo$.region_id))

bos <- bos_expanded |> 
  left_join(df)

### Create spacetime object called `bos`
bos <- spacetime(bos, geo, 
                 .loc_col = ".region_id",
                 .time_col = "time_period")

bos

bos_nb <- bos |> 
  activate("geometry") |> 
  mutate(
    nb = include_self(st_contiguity(geometry, queen = T)),
    wt = st_weights(nb, allow_zero = T, zero.policy = T)
  ) |> 
  set_nbs("nb") |> 
  set_wts("wt")

head(bos_nb)

gi_stars <- bos_nb |> 
  filter(!is.na(value)) |> 
  group_by(time_period) |> 
  mutate(gi_star = local_gstar_perm(value, nb, wt)) |> 
  tidyr::unnest(gi_star)

gi_stars |> 
  select(.region_id, time_period, gi_star, p_folded_sim)

### conduct EHSA
ehsa <- emerging_hotspot_analysis(
  x = bos, 
  .var = "value", 
  k = 1, 
  nsim = 99
)

### preview some values 
head(ehsa)

### evaluate the classifications of our hotspots
count(ehsa, classification)
JosiahParry commented 6 months ago

Thanks, @rafalopespx! In the future, would you mind wrapping the R code in backticks like this? It would make this a lot more easy to read!

```r
# R code here

First and foremost, you cannot have NA values when calculating the local Gi. If you want to use `emerging_hotspot_analysis()` you will need to impute those values. Or if you want to do something that is not _exactly_ emerging hotspot analysis but quite close (like the vignette) you will need to filter those locations out, then calculate the neighbors based on each time period like so: 

```r
library(sfdep)
library(dplyr)

df_fp <- system.file("extdata", "bos-ecometric.csv", package = "sfdep")
geo_fp <- system.file("extdata", "bos-ecometric.geojson", package = "sfdep")

df <- readr::read_csv(df_fp, col_types = "ccidD")
df <- df |>
mutate(value = ifelse(rbinom(n(), 1, 0.1), NA, value))
geo <- sf::st_read(geo_fp)

bos_expanded <- expand.grid(time_period = unique(df$time_period),
.region_id = unique(geo$.region_id))

bos <- bos_expanded |>
  left_join(df)

bos |> 
  filter(!is.na(value)) |> 
  left_join(geo) |> 
  group_by(time_period) |> 
  mutate(
    nb = list(st_contiguity(geometry)),
    wt = st_weights(nb[[1]], allow_zero = TRUE),
    g = local_g_perm(value, nb[[1]], wt)
  ) |> 
    arrange(time_period)-> tmp

tmp |> 
  tidyr::unnest(g)
rafalopespx commented 6 months ago

Amazing Josiah! It worked, do you have any tips to speed up the process? My dataset is 7665 polygons by 628 time-periods

Edit: One more thing, do you think which way is better? Imputing values, I would do that imputing 0 to then, as I need to have a sense that the values are not significant or calculate the neighbors based on each for each period?

Edit2: I might realizing this now, I just can't produce an emerging_hotspot_analysis once I got NA on some of the slices of the spacetime cube, right? There is no workaround to this?