Closed NateMietk closed 6 years ago
Hey @NateMietk - can you tell me how to reproduce the issue? It would help me help you!
Hey @mbjoseph sure - sorry about that!
Ok you can start a small EC2 instance (m5.2xlarge).
Pull down the GitHub repo: git clone https://github.com/NateMietk/human-ignitions-wui
Transfer these files: s3://earthlab-natem/human-ignitions-wui/fire/fpa-fod/fpa_wui_conus.gpkg s3://earthlab-natem/human-ignitions-wui/anthro/wui/urban_1990.gpkg s3://earthlab-natem/human-ignitions-wui/anthro/wui/urban_2000.gpkg s3://earthlab-natem/human-ignitions-wui/anthro/wui/urban_2010.gpkg
Starting running scripts in this order (all found in src/r
):
1_create_folders.R
fpa_wui
object (line 204) from 3_clean_data.R
5_create_distance.R
is the meat. The first half is importing the urban polygon layers and the latter half is prepping the fpa_wui
and running the distance function. get_polygons
) can be found in the src/functions/helper_functions.R
script. This is automatically read in during the 1_create_folders
script.ps. if you want to see how I implemented lapply and sfLapply just take a peak at the history of 5_create_distance.R
Hey @NateMietk
Here's a way to compute KNN between all fires and urban point centroids, using the 2010 urban data as an example. It only takes a second to do the KNN search. Does this give you enough to work with?
library(sf)
library(nabor)
# load fire ignition data (takes minute or two)
fires <- st_read('data/fire/fpa-fod/fpa_conus.gpkg')
# find centroids for all of the urban areas
urban <- st_read('data/anthro/wui/urban_2010.gpkg')
urban_polys <- st_cast(urban, 'POLYGON')
urban_centroids <- st_centroid(urban_polys)
# match spatial CRS
urban_centroids <- st_transform(urban_centroids, st_crs(fires))
# assert that fires and urban centroids have same CRS
stopifnot(st_crs(fires) == st_crs(urban_centroids))
# compute KNN between fires and urban point centroids
fire_point_coords <- st_coordinates(fires)
urban_point_coords <- st_coordinates(urban_centroids)
nearest_neighbors <- knn(urban_point_coords, fire_point_coords, k = 4)
Hey @mbjoseph I had already done this, and its still super slow. I implemented the full KNN instead of an Id-by-Id KNN but either way stills is brutally slow. California hasn't even reached 1% complete and its been 30 minutes. And the polygon output is still huge and unable to parallelize. Regardless if I do the KNN upfront of inline it is still as slow. Updated the code if you want to see it. Any other suggestions would be appreciated.
N
@NateMietk maybe I'm not remembering the next step after this. After finding the knn for a fire, are you then trying to find the minimum distance to an urban polygon within the set if knn?
Other question: which part is slow? Have you tried profiling your code to see?
@mbjoseph yes, that is correct. I am have to generate a distance from each point to the 4 KNN polygons then take the minimum distance. It all works, and works well when I try it on a slice of the data. But it doesn't play nicely with the whole data set.
Bogs down here (line 114):
for (h in 1:length(unique_ids)) {
fpa_ids <- unique_ids[h]
fpa_df <- subset(state_df, state_df$FPA_ID == fpa_ids)
closest_centroids <-
subset(nearest_neighbors,
nearest_neighbors$FPA_ID == fpa_ids)
distance_to_fire[[h]] <- fpa_df %>%
dplyr::select(-FPA_ID) %>%
mutate(
distance_to_urban = min(
st_distance(
st_geometry(closest_centroids),
st_geometry(.),
by_element = TRUE
)
),
FPA_ID = data.frame(fpa_df)$FPA_ID
)
setTxtProgressBar(pb, h)
}
Cool - @NateMietk that makes sense. Which operations in that inner loop are sucking up the most time? You might try https://github.com/rstudio/profvis to see quickly
Also - can you use KNN w/ k=1 here instead of computing all pairwise distances between the fire and polygon vertices then taking the min?
@mbjoseph sweet package - that is so useful!
Looks like it is the subsets that are taking up the time. The actual distance calculation is very speedy. The trouble is that I cannot do this by row if I have more than 1 KNN therefore I have to do it by FPA_ID spilts.
Things crashed so I am going to re-run and send a screen shot
Slight performance increase using filter
instead of subset
.
Oh yeah profvis
is great.
I thought about this some more, and I think you might be able to just use KNN the whole way and avoid a good deal of shennangans. Expanding on my last example for one of the urban layers, you can compute the nearest neighbor for every fire and urban polygon layer vertex as follows, with the KNN step taking ~11 sec:
library(sf)
library(nabor)
# load fire ignition data (takes minute or two)
fires <- st_read('data/fire/fpa-fod/fpa_conus.gpkg')
# find centroids for all of the urban areas
urban <- st_read('data/anthro/wui/urban_2010.gpkg')
urban_polys <- st_cast(urban, 'POLYGON') %>%
st_transform(st_crs(fires))
# assert that fires and urban polys have same CRS
stopifnot(st_crs(fires) == st_crs(urban_polys))
# get coords
fire_point_coords <- st_coordinates(fires)
urban_coords <- st_coordinates(urban_polys)
# compute KNN between fires and urban poly vertices
system.time(nearest_neighbors <- nabor::knn(data = urban_coords[, c('X', 'Y')],
fire_point_coords,
k = 1))
Maybe simpler @NateMietk?
KNN isn't the time sink through - it is apparently how I am subsetting the data...
Right - but with the workflow above, you don't need to do any of the data subsetting. You immediately get the index for the urban polygon and the distance for each fire. So, you just have to loop over years, rather than years, states, and fire IDs.
ohh ok just got that - hmmmm. Let me play around with that. Very clever idea... Thanks - be in touch
@NateMietk seems like this is fixed! Can we close this issue?
Yup! Closing
https://github.com/NateMietk/human-ignitions-wui/blob/8104e8bdbe0d857fb7605ec3e1eead7cf58f909b/src/R/analysis/5_create_distance.R#L77-L147
@mbjoseph I was hoping to get this sorted, but doesn't seem to be working. When ran on a small subset this code works great, but as soon as it is scaled out it would take something like 15000 hours to complete. Not going to happen. This iteration is using for loops but I have have the same results in when using
lapply
. This iteration also is single core as when it is run in parallel the shape files are so large that it takes days to push everything to multicore. I chose to subset the runs based on state because the output list of all 1.8 million fire points was too large and would result in the session crashing. Any thoughts on how to improve the speed of this function? Happy to sit down and chat if you have time.Thanks!