JeremyGelb / spNetwork

An R package to perform spatial analysis on networks.
GNU General Public License v2.0
35 stars 2 forks source link

"replacement has [x] rows, data has [y]" error encountered using nkde.mc function #15

Closed bright1993ff66 closed 1 year ago

bright1993ff66 commented 1 year ago

Thank the author for creating such a convenient package for spatial analysis!

I am currently working on analyzing the spatial distribution of accidents in the road network. I want to use the network KDE provided by the nkde.mc function in the spNetwork to complete this task. The codes I used are presented as follows:

# reference site:
# https://cran.r-project.org/web/packages/spNetwork/vignettes/NKDE.html

# Basic packages
library(Rcpp)
library(htmlTable)
library(Gmisc)  # cope with paths
library(ggplot2)
library(reshape2)
library(kableExtra)
library(RColorBrewer)
library(dplyr)

# For spatial analysis
library(spNetwork)
library(sf)
library(tmap)
library(sp)
library(rgdal)

# Set the working directory
work_dir = 'D:/Projects/hotspot_rank/ny_shapefiles'
setwd(work_dir)
save_dir_lixels = pathJoin(work_dir, 'nkde_accs_results')

# set the random seed
set.seed(777)

# setting the multisession plan
future::plan(future::multisession(workers=2))

# Load the border, grids, and accident records
ny_border = st_read('newyork_border_proj.shp')
ny_roads = st_read('StreetSegment_NY.shp')

# Compute the network kernel densities
project_shape <- function(shape, target_crs){
  if (st_crs(shape)$epsg == target_crs){
    return (shape)
  } else {
    shape_project = st_transform(shape, target_crs)
    return (shape_project)
  }
}

ny_roads_proj <- project_shape(shape = ny_roads, target_crs = 32118)
ny_roads_select <- select(ny_roads_proj, c('NYSStreetI', 'geometry'))

# Load the samples for kernel density computation
samples_on_roads <- st_read('lixel_samples.shp')
lixels <- st_read('lixels.shp')

# Check multiple settings of bandwidth
shapefiles <- list.files(path = pathJoin(work_dir, 'nkde_accs'), 
                         pattern = "*.shp$")

# for the kernel density estimates based on adaptive bandwith
# reference: https://cran.r-project.org/web/packages/spNetwork/vignettes/NKDE.html#an-example-with-adaptive-bandwidth
for (shapefile in shapefiles){
  sprintf("---Coping with the file: %s---", shapefile)
  accs_data = st_read(pathJoin(work_dir, 'nkde_accs', shapefile))
  file_count_str = gsub(pattern = "\\D", replacement = "", x = shapefile)
  accs_select <- select(accs_data, c('COLLISION_', 'geometry'))
  # accs_select <- type.convert(accs_select, as.is = TRUE)
  adapt_densities <- nkde.mc(
                       lines = ny_roads_select, 
                       events = accs_select,
                       w = rep(1, nrow(accs_select)),
                       samples = samples_on_roads,
                       kernel_name = "gaussian",
                       bw = 500, # set a reference bandwidth
                       adaptive = TRUE,
                       trim_bw = 1000,  # set the maximum local value of bandwidth
                       div= "bw", 
                       method = "discontinuous", 
                       digits = 2, 
                       tol = 1,
                       grid_shape = c(5,5),  # for fast computation
                       max_depth = 8,
                       agg = 100, #we aggregate events within a 100m radius (faster calculation)
                       sparse = TRUE,
                       verbose = TRUE, 
                       check = TRUE, 
                       diggle_correction = TRUE, 
                       study_area = ny_border)
  lixels$densities <- adapt_densities$k
  st_write(lixels, 
           pathJoin(save_dir_lixels,   # as.character(integer) -> string
                    paste('nkde_', file_count_str, '_', 'adaptive', ".shp", sep = "")), 
           append = FALSE)
  print('---Done---')
}

But I got the following error continuously in the console when verbose code is "calculating the local bandwidth...":

Error in `[[<-.data.frame`(`*tmp*`, i, value = 1:0) : 
Replacement has 2 rows, but data has 0

Can anyone give me some hints about this error message? It seems like it was due to a dimension dismatch problem, as indicated by some questions in stackoverflow. But I cannot figure it out....

JeremyGelb commented 1 year ago

Hello there ! First of all, thank you for your interest in spNetwork. I would be happy to help you with this bug. I think it could be some kind of edge case specific to your dataset. But you could check the following first :

  1. Check which shapefile is causing the error (it will help to reproduce the error)
  2. Have you checked that the events are not too far from the road network ?
  3. samples_on_roads are point geometry ?

If it is not helping, could you send me the data required to reproduce the bug (one of the shapefiles with the points, the network and the border) ?

bright1993ff66 commented 1 year ago

Hello there ! First of all, thank you for your interest in spNetwork. I would be happy to help you with this bug. I think it could be some kind of edge case specific to your dataset. But you could check the following first :

  1. Check which shapefile is causing the error (it will help to reproduce the error)
  2. Have you checked that the events are not too far from the road network ?
  3. samples_on_roads are point geometry ?

If it is not helping, could you send me the data required to reproduce the bug (one of the shapefiles with the points, the network and the border) ?

Dear @JeremyGelb ,

Thank you for your response! I double check the questions you mentioned:

  1. I double-checked all the involved shapefiles. Firstly, all of them are in the same project coordinate system (epsg = 32118). Secondly, they are in the proper shape (ny_roads_select: LINESTRING; accs_select: POINT; samples_on_roads: POINT; lixels: LINESTRING)
  2. The events I used (accs_select) are spatially adjacent to the roads I think. Here is a plot showing the distribution of New York city border, the road shapefile (in grey), and the accident records (in red) NY_acc_plot
  3. The samples_on_roads is only composed of points. The samples_on_roads and lixels shapefiles are generated by the function lixelize_lines.mc in the spNetwork package as follows:
    lixels <- lixelize_lines.mc(ny_roads_select, lx_length = 200, 
                               verbose=TRUE, mindist = 100)
    samples <- lines_center(lixels)
    st_write(samples, 'lixel_samples.shp', append=FALSE)
    st_write(lixels, 'lixels.shp', append=FALSE)

    Here, I uploaded the shapefiles to reproduce the error message. The files were named according to the codes. Thank you for your time and support!

shapefiles_nkde.zip

bright1993ff66 commented 1 year ago

@JeremyGelb Sorry for not uploading the border shapefile. Here is the shapefile for border of New York City, with project coordinate system epsg = 32118

ny_border.zip

JeremyGelb commented 1 year ago

Great! I will have a look this week.

bright1993ff66 commented 1 year ago

I am not sure if I have worked this out...

I think the problem is about the input. The many of road segments I used previously do not have any accident records nearby, which cause significant zero-inflation problem. Hence, I removed the road segments where no collision record was found in nearby 5-meter zone (done by ArcGIS select by location function). About 20% of road segments were selected for the following network kernel density calculation. Then I computed the network kernel density using the following code:

adapt_densities <- nkde.mc(
                       lines = ny_roads_select, 
                       events = accs_select,
                       w = rep(1, nrow(accs_select)),
                       samples = samples_on_roads,
                       kernel_name = "gaussian",
                       bw = 500, # set a reference bandwidth
                       adaptive = TRUE,
                       trim_bw = 1000,  # set the maximum local value of bandwidth
                       div= "bw", 
                       method = "discontinuous", 
                       digits = 2, 
                       tol = 1,
                       grid_shape = c(2,2),  # for fast computation
                       max_depth = 8,
                       agg = 100, #we aggregate events within a 100m radius (faster calculation)
                       sparse = TRUE,
                       verbose = TRUE, 
                       check = TRUE)

Here I attached the shapefiles I used to run the above codes.

update_shapefiles.zip

JeremyGelb commented 1 year ago

Hello ! It is good to know that the error is certainly caused by the data used. However, removing many roads can cause a problem. The densities of the events are spread along the roads. If one is missing, the local densities will be overstimated (before the missing link) and underestimated (after the missing link). I would suggest two things here :

  1. checking if all the roads have a valid geometry (st_is_valid, st_make_valid), removing the roads with a length equal to 0.
  2. Changing the parameter grid_shape with other values (like c(10,10) or c(9,9)). It is not impossible that the splitting operation has created an edge case.
JeremyGelb commented 1 year ago

I am working on it, but i have found some elements so far :

  1. some events are outside the limits of the study_area
  2. the study area has distinct islands
  3. some islands have internal borders

The second point could not be the cause of the problem, but it is the first time I see this configuration. I think that the best practice here would be to edit the polygons to have only one big polygon (by drawing bridges in Qgis) or to split the analysis for each island

The last point is the most concerning. It produces a case where the density of a point cross the border but remains in the study area. I suggest here to union the geometries in the same island :

border <- st_union(st_combine(border), by_feature = TRUE)

I am still working on it, but I am 98% sure that it is the cause of the problem

bright1993ff66 commented 1 year ago

Dear @JeremyGelb ,

Thank you for your suggestions! I will check the three points you mentioned so far and report soon~

I am working on it, but i have found some elements so far :

  1. some events are outside the limits of the study_area
  2. the study area has distinct islands
  3. some islands have internal borders

The second point could not be the cause of the problem, but it is the first time I see this configuration. I think that the best practice here would be to edit the polygons to have only one big polygon (by drawing bridges in Qgis) or to split the analysis for each island

The last point is the most concerning. It produces a case where the density of a point cross the border but remains in the study area. I suggest here to union the geometries in the same island :

border <- st_union(st_combine(border), by_feature = TRUE)

I am still working on it, but I am 98% sure that it is the cause of the problem

Dear @JeremyGelb ,

Thank you for your suggestions! I will check the three points you mentioned so far and report soon~

bright1993ff66 commented 1 year ago

Hello ! It is good to know that the error is certainly caused by the data used. However, removing many roads can cause a problem. The densities of the events are spread along the roads. If one is missing, the local densities will be overstimated (before the missing link) and underestimated (after the missing link). I would suggest two things here :

  1. checking if all the roads have a valid geometry (st_is_valid, st_make_valid), removing the roads with a length equal to 0.
  2. Changing the parameter grid_shape with other values (like c(10,10) or c(9,9)). It is not impossible that the splitting operation has created an edge case.

@JeremyGelb Thank you for your reply! But how to cope with this kind of zero-inflation problem in kernel density estimation? The collision records are sparsely distributed in space, and the majority of roads do not have collision records nearby. This can result in a density estimate that is biased towards the areas with accidents, and the areas with no accidents are not represented accurately.

Do you have any suggestions about dealing with the zero-inflation problem in this case? Thanks!

JeremyGelb commented 1 year ago

Hello !

I have been able to run the analysis with the following code on the data you send me. My guess about the border was right.

library(sf)
library(spNetwork)

roads <- st_read("C:/Users/Gelb/Desktop/TEMP/spNetwork_issue15/data/ny_roads_select.shp")
events <- st_read("C:/Users/Gelb/Desktop/TEMP/spNetwork_issue15/data/acc_select.shp")
border <- st_read("C:/Users/Gelb/Desktop/TEMP/spNetwork_issue15/data/ny_border.shp")

border <- st_buffer(border, 5)
border <- st_union(st_combine(border), by_feature = TRUE)

sample_lines <- st_read("C:/Users/Gelb/Desktop/TEMP/spNetwork_issue15/data/lixels.shp")

sample_points <- lines_center(sample_lines)

# let me remove points ouside the border

test <- lengths(st_intersects(events,border))>0
events <- subset(events, test)
events$w <- 1

# length check
sum(as.numeric(st_length(roads)) == 0)
# OK

# validity test
sum(!st_is_valid(roads))
# OK

# let me calculate the density first with an abritrary BW

results1 <- nkde(
  lines = roads,
  events = events,
  w = events$w ,
  samples = sample_points,
  kernel_name = "gaussian",
  bw = 500,
  trim_bw = 1200,
  adaptive = TRUE,
  method = "discontinuous",
  diggle_correction = TRUE,
  study_area = border,
  digits = 2,
  tol = 1,
  grid_shape = c(8,8),  # for fast computation
  max_depth = 8,
  agg = 50, #we aggregate events within a 100m radius (faster calculation)
  sparse = TRUE,
  verbose = TRUE,
  check = FALSE,
  div = "bw"
  )

sample_points$dens <- results1$k * 1000

library(tmap)

tm_shape(sample_points) + 
  tm_dots(col = "dens", style = "kmeans", n = 8)

Here is a little map of the results

image

About 0 inflation, I think that it is not a problem for the NKDE. Kernel density methods are only descriptive methods and not modeling methods. The presence or the absence of links without events is not impacting the calculus for the other links. However, if you plan to use the obtained densities in a regression model (to understand the impact of some predictors on the densities) then you might have a 0 inflation problem.

bright1993ff66 commented 1 year ago

@JeremyGelb Thank you for your codes, and also, much thanks for your feedback about the zero-inflation problem!

I will revise my codes later. Thanks again for your time and contribution to this open question. I will close this question then~

JeremyGelb commented 1 year ago

Great ! Thank you again for your interest, let me know if you encounter another problem later.