Slow when running single core and ancillary data too large for parallel....

NateMietk commented 6 years ago

https://github.com/NateMietk/human-ignitions-wui/blob/8104e8bdbe0d857fb7605ec3e1eead7cf58f909b/src/R/analysis/5_create_distance.R#L77-L147

@mbjoseph I was hoping to get this sorted, but doesn't seem to be working. When ran on a small subset this code works great, but as soon as it is scaled out it would take something like 15000 hours to complete. Not going to happen. This iteration is using for loops but I have have the same results in when using lapply. This iteration also is single core as when it is run in parallel the shape files are so large that it takes days to push everything to multicore. I chose to subset the runs based on state because the output list of all 1.8 million fire points was too large and would result in the session crashing. Any thoughts on how to improve the speed of this function? Happy to sit down and chat if you have time.

Thanks!

mbjoseph commented 6 years ago

Hey @NateMietk - can you tell me how to reproduce the issue? It would help me help you!

NateMietk commented 6 years ago

Hey @mbjoseph sure - sorry about that!

Ok you can start a small EC2 instance (m5.2xlarge).

Pull down the GitHub repo: git clone https://github.com/NateMietk/human-ignitions-wui

Transfer these files: s3://earthlab-natem/human-ignitions-wui/fire/fpa-fod/fpa_wui_conus.gpkg s3://earthlab-natem/human-ignitions-wui/anthro/wui/urban_1990.gpkg s3://earthlab-natem/human-ignitions-wui/anthro/wui/urban_2000.gpkg s3://earthlab-natem/human-ignitions-wui/anthro/wui/urban_2010.gpkg

Starting running scripts in this order (all found in src/r):

The entire 1_create_folders.R
Just the section to import the fpa_wui object (line 204) from 3_clean_data.R
Then 5_create_distance.R is the meat. The first half is importing the urban polygon layers and the latter half is prepping the fpa_wui and running the distance function.
Note all helper functions used (i.e., get_polygons) can be found in the src/functions/helper_functions.R script. This is automatically read in during the 1_create_folders script.

ps. if you want to see how I implemented lapply and sfLapply just take a peak at the history of 5_create_distance.R

mbjoseph commented 6 years ago

Hey @NateMietk

Here's a way to compute KNN between all fires and urban point centroids, using the 2010 urban data as an example. It only takes a second to do the KNN search. Does this give you enough to work with?

library(sf)
library(nabor)

# load fire ignition data (takes minute or two)
fires <- st_read('data/fire/fpa-fod/fpa_conus.gpkg')

# find centroids for all of the urban areas
urban <- st_read('data/anthro/wui/urban_2010.gpkg')
urban_polys <- st_cast(urban, 'POLYGON')
urban_centroids <- st_centroid(urban_polys)

# match spatial CRS
urban_centroids <- st_transform(urban_centroids, st_crs(fires))

# assert that fires and urban centroids have same CRS
stopifnot(st_crs(fires) == st_crs(urban_centroids))

# compute KNN between fires and urban point centroids
fire_point_coords <- st_coordinates(fires)
urban_point_coords <- st_coordinates(urban_centroids)
nearest_neighbors <- knn(urban_point_coords, fire_point_coords, k = 4)

NateMietk commented 6 years ago

Hey @mbjoseph I had already done this, and its still super slow. I implemented the full KNN instead of an Id-by-Id KNN but either way stills is brutally slow. California hasn't even reached 1% complete and its been 30 minutes. And the polygon output is still huge and unable to parallelize. Regardless if I do the KNN upfront of inline it is still as slow. Updated the code if you want to see it. Any other suggestions would be appreciated.

N

mbjoseph commented 6 years ago

@NateMietk maybe I'm not remembering the next step after this. After finding the knn for a fire, are you then trying to find the minimum distance to an urban polygon within the set if knn?

Other question: which part is slow? Have you tried profiling your code to see?

NateMietk commented 6 years ago

@mbjoseph yes, that is correct. I am have to generate a distance from each point to the 4 KNN polygons then take the minimum distance. It all works, and works well when I try it on a slice of the data. But it doesn't play nicely with the whole data set.

Bogs down here (line 114):

      for (h in 1:length(unique_ids)) {
        fpa_ids <- unique_ids[h]
        fpa_df <- subset(state_df, state_df$FPA_ID == fpa_ids)

        closest_centroids <-
          subset(nearest_neighbors,
                 nearest_neighbors$FPA_ID == fpa_ids)

        distance_to_fire[[h]] <- fpa_df %>%
          dplyr::select(-FPA_ID) %>%
          mutate(
            distance_to_urban = min(
              st_distance(
                st_geometry(closest_centroids),
                st_geometry(.),
                by_element = TRUE
              )
            ),
            FPA_ID = data.frame(fpa_df)$FPA_ID
          )
        setTxtProgressBar(pb, h)
      }

mbjoseph commented 6 years ago

Cool - @NateMietk that makes sense. Which operations in that inner loop are sucking up the most time? You might try https://github.com/rstudio/profvis to see quickly

mbjoseph commented 6 years ago

Also - can you use KNN w/ k=1 here instead of computing all pairwise distances between the fire and polygon vertices then taking the min?

NateMietk commented 6 years ago

@mbjoseph sweet package - that is so useful!

Looks like it is the subsets that are taking up the time. The actual distance calculation is very speedy. The trouble is that I cannot do this by row if I have more than 1 KNN therefore I have to do it by FPA_ID spilts.

Things crashed so I am going to re-run and send a screen shot

NateMietk commented 6 years ago

Slight performance increase using filter instead of subset.

mbjoseph commented 6 years ago

Oh yeah profvis is great.

I thought about this some more, and I think you might be able to just use KNN the whole way and avoid a good deal of shennangans. Expanding on my last example for one of the urban layers, you can compute the nearest neighbor for every fire and urban polygon layer vertex as follows, with the KNN step taking ~11 sec:

library(sf)
library(nabor)

# load fire ignition data (takes minute or two)
fires <- st_read('data/fire/fpa-fod/fpa_conus.gpkg')

# find centroids for all of the urban areas
urban <- st_read('data/anthro/wui/urban_2010.gpkg')
urban_polys <- st_cast(urban, 'POLYGON') %>%
  st_transform(st_crs(fires))

# assert that fires and urban polys have same CRS
stopifnot(st_crs(fires) == st_crs(urban_polys))

# get coords
fire_point_coords <- st_coordinates(fires)
urban_coords <- st_coordinates(urban_polys)

# compute KNN between fires and urban poly vertices
system.time(nearest_neighbors <- nabor::knn(data = urban_coords[, c('X', 'Y')], 
                                            fire_point_coords, 
                                            k = 1))

Maybe simpler @NateMietk?

NateMietk commented 6 years ago

KNN isn't the time sink through - it is apparently how I am subsetting the data...

mbjoseph commented 6 years ago

Right - but with the workflow above, you don't need to do any of the data subsetting. You immediately get the index for the urban polygon and the distance for each fire. So, you just have to loop over years, rather than years, states, and fire IDs.

NateMietk commented 6 years ago

ohh ok just got that - hmmmm. Let me play around with that. Very clever idea... Thanks - be in touch

mbjoseph commented 6 years ago

@NateMietk seems like this is fixed! Can we close this issue?

NateMietk commented 6 years ago

Yup! Closing

NateMietk / human-ignitions-wui

Slow when running single core and ancillary data too large for parallel.... #5