ailich / GLCMTextures

This R package calculates the most common gray-level co-occurrence matrix (GLCM) texture metrics used for spatial analysis on raster data.
https://ailich.github.io/GLCMTextures/
GNU General Public License v3.0
12 stars 4 forks source link

parallel processing? #25

Open hakimabdi opened 8 months ago

hakimabdi commented 8 months ago

Hi, and thank you for this wonderful package. Do you know how to parallelize or otherwise speedup both the quantization and the GLCM creation? I am working with large raster tiles and it takes a long time for the GLCMs to be created per tile.

ailich commented 8 months ago

@hakimabdi, I don't have parallel processing implement at the moment, but it is something I've been interested in adding for a bit. I'll do a bit of tests this week and see if it's something I can feasibily do.

ailich commented 8 months ago

@hakimabdi, as of now parallel processing isn't directly implemented for focal and focalCpp in the terra package which is what GLCMTextures uses. I tried to do a custom implementation by breaking the raster into chunks and sending out each chunk to a core, but that doesn't seem to work on terra objects. I have brought this up on the terra issue about parallelization (rspatial/terra#36) so maybe somebody there will have some guidance on how it could be implemented.

hakimabdi commented 8 months ago

Thank you for looking into this and for developing this very useful package. I'll just continue to do it the normal until a solution is developed within terra. Cheers!

On Tue, 14 Nov 2023 at 20:43, Alex @.***> wrote:

@hakimabdi https://github.com/hakimabdi, as of now parallel processing isn't directly implemented for focal and focalCpp in the terra package which is what GLCMTextures uses. I tried to do a custom implementation by breaking the raster into chunks and sending out each cunk to a core, but that doesn't seem to work on terra objects. I have brought this up on the terra issue about parallelization (rspatial/terra#36 https://github.com/rspatial/terra/issues/36) so maybe somebody there will have some guidance on how it could be implemented.

— Reply to this email directly, view it on GitHub https://github.com/ailich/GLCMTextures/issues/25#issuecomment-1811082330, or unsubscribe https://github.com/notifications/unsubscribe-auth/AASV4H4TTJ2UUKL4LMU6O7DYEPCVHAVCNFSM6AAAAAA7GPZWMGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMJRGA4DEMZTGA . You are receiving this because you were mentioned.Message ID: @.***>

ailich commented 8 months ago

@hakimabdi, I've actually now got it to work using the wrap and unwrap functions. The performance benefit seems to increase with larger raster objects. I may try to build it in at some point, but for now you could adapt the code below. In this code, I've parallelized the texture calculations and not the quantization since the quantization needs the data from the entire raster to determine quantiles or the overall max/min values in the dataset. That being said, the reclassify step of quantize_raster could be parallelized and the max/min of all the chunks could be calculated in parallel and then the max/min of those values calcuated to get the overall max/min.

With this method, you'll need to run the quantization first separately in serial and then calculate the textures on the quantized raster. Since the raster is already quantized be sure to set quantization to "none" in glcm_textures. To speed up the quantize_raster step, you could use the "equal range" quantization method, or use the maxcell parameter that I just added which can be used to specify a number of samples to use for "equal prob" to estimate the quantiles instead of reading in the entire dataset.

library(GLCMTextures)
#> Loading required package: terra
#> Warning: package 'terra' was built under R version 4.3.1
#> terra 1.7.47
library(future)
library(future.apply)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:terra':
#> 
#>     intersect, union
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(microbenchmark)
#> Warning: package 'microbenchmark' was built under R version 4.3.2

# Set up
dims<- 3500
n_grey<- 16
set.seed(5)
rq<-rast(matrix(data = sample(0:(n_grey-1), size = dims^2, replace = TRUE), nrow = dims, ncol=dims))
ncores<- parallelly::availableCores(omit=1)

w<- c(3,3)
plot(rq, col=grey.colors(100))


# Create breaks for chunking raster
buffer<- (w[1]-1)/2 # Buffer for focal window
breaks<- data.frame(start = ceiling(seq(1, nrow(rq), length.out = ncores+1)))
breaks<- breaks %>% mutate(end=lead(start, n=1)+buffer)
breaks$end[breaks$end > nrow(rq)]<- nrow(rq)
breaks<- breaks[-nrow(breaks),]

# Put raster chunks in list
r_list<- vector(mode = "list", length = ncores)
for (i in 1:ncores) {
  r_list[[i]]<- wrap(rq[breaks$start[i]:breaks$end[i], ,drop=FALSE])
}

glcmtextures_parallel<- function(x, ...){
  wrap(glcm_textures(unwrap(x),...))
}

# In serial
ts1<- Sys.time()
serial<- glcm_textures(rq, w=w, n_levels = n_grey, quantization = "none")
#> |---------|---------|---------|---------|=========================================                                          |---------|---------|---------|---------|=========================================                                          |---------|---------|---------|---------|=========================================                                          |---------|---------|---------|---------|=========================================                                          
ts2<- Sys.time()
difftime(ts2, ts1, units = "secs")
#> Time difference of 1344.308 secs

#In Parallel
tp1<- Sys.time()
plan(strategy = "multisession", workers=ncores) #Set up parallel
res_list<- future_lapply(r_list, FUN = glcmtextures_parallel, w=w, n_levels = n_grey, quantization = "none")
#> |---------|---------|---------|---------|=========================================                                          |---------|---------|---------|---------|=========================================                                          |---------|---------|---------|---------|=========================================                                          |---------|---------|---------|---------|=========================================                                          |---------|---------|---------|---------|=========================================                                          |---------|---------|---------|---------|=========================================                                          |---------|---------|---------|---------|=========================================                                          |---------|---------|---------|---------|=========================================                                          |---------|---------|---------|---------|=========================================                                          |---------|---------|---------|---------|=========================================                                          |---------|---------|---------|---------|=========================================                                          |---------|---------|---------|---------|=========================================                                          |---------|---------|---------|---------|=========================================                                          |---------|---------|---------|---------|=========================================                                          |---------|---------|---------|---------|=========================================                                          |---------|---------|---------|---------|=========================================                                          |---------|---------|---------|---------|=========================================                                          |---------|---------|---------|---------|=========================================                                          |---------|---------|---------|---------|=========================================                                          |---------|---------|---------|---------|=========================================                                          |---------|---------|---------|---------|=========================================                                          |---------|---------|---------|---------|=========================================                                          |---------|---------|---------|---------|=========================================                                          |---------|---------|---------|---------|=========================================                                          |---------|---------|---------|---------|=========================================                                          |---------|---------|---------|---------|=========================================                                          |---------|---------|---------|---------|=========================================                                          |---------|---------|---------|---------|=========================================                                          |---------|---------|---------|---------|=========================================                                          |---------|---------|---------|---------|=========================================                                          |---------|---------|---------|---------|=========================================                                          |---------|---------|---------|---------|=========================================                                          |---------|---------|---------|---------|=========================================                                          |---------|---------|---------|---------|=========================================                                          |---------|---------|---------|---------|=========================================                                          |---------|---------|---------|---------|=========================================                                          |---------|---------|---------|---------|=========================================                                          |---------|---------|---------|---------|=========================================                                          |---------|---------|---------|---------|=========================================                                          |---------|---------|---------|---------|=========================================                                          
plan(strategy = "sequential") #Go back to serial processing
res_list<- lapply(res_list, unwrap)
parallel<- do.call(terra::merge, res_list) #Merge back into single raster
tp2<- Sys.time()
difftime(tp2, tp1, units = "secs")
#> Time difference of 427.9501 secs

Created on 2023-11-15 with reprex v2.0.2

hakimabdi commented 8 months ago

Thank you very much, @ailich , this workaround is 5x faster on my data. The quantize_raster() function was always fast for me, but the glcm_textures() was the one that took hours to complete. I'll keep using your approach going forward until something better comes around.

ailich commented 8 months ago

@hakimabdi, glad this helped speed up your processing! For now, you can use this script, but at some point I hope to add this into the function itself so it's as simple as specifying the number of cores and then this is all done automatically. You can also try using plan="multicore" if you have Mac or Linux which implements "fork" parallelization. I only have Windows so I haven't tried it, but it's generally faster but there can be errors due to cross-contamination by shared memory among the cores, so you'd need to do a bit of testing to make sure the results come out correctly.

hakimabdi commented 8 months ago

@ailich thanks again. I am on a Windows 10 machine. I made a custom function based on your helpful code. I hope it is of use to others who are also looking for parallelize the calculation of GLCM textures.

library(future)
library(future.apply)
library(dplyr)
library(GLCMTextures)
library(terra)

glcmParallel = function(quantized, window.size, glevels, ncores){  # Begin function

# Set up parameters
  rq <- quantized
  n_grey <- glevels
  w <- window.size # c(3,3)
  ncores <- ncores

  buffer <- (w[1]-1)/2 # Buffer for focal window
  breaks<- data.frame(start = ceiling(seq(1, nrow(rq), length.out = ncores+1)))
  breaks<- breaks %>% mutate(end=lead(start, n=1)+buffer)
  breaks$end[breaks$end > nrow(rq)]<- nrow(rq)
  breaks<- breaks[-nrow(breaks),]

  # Put raster chunks in list
  r_list<- vector(mode = "list", length = ncores)
  for (i in 1:ncores) {
    r_list[[i]]<- wrap(rq[breaks$start[i]:breaks$end[i], ,drop=FALSE])
  }

  glcmtextures_parallel<- function(x, ...){
    wrap(glcm_textures(unwrap(x),...))
  }

  plan(strategy = "multisession", workers=ncores) #Set up parallel
  res_list<- future_lapply(r_list, FUN = glcmtextures_parallel, w=w, n_levels = n_grey, quantization = "none")
  plan(strategy = "sequential") #Go back to serial processing
  res_list<- lapply(res_list, unwrap)
  parallel<- do.call(terra::merge, res_list) #Merge back into single raster
  names(parallel) = sub(pattern = "glcm_", x = names(parallel), replacement = paste0(w[1],"x",w[1]))
  return(parallel)
  #terra::writeRaster(parallel, paste0("texture_",glevels,"_",paste0(w[1],"x",w[1]),".tif"))
  gc() # Clear memory
} # End function

# system.time({
# ncores<- parallelly::availableCores(omit=2)
# testParallel = glcmParallel(quantized = quantized, window.size = c(3,3), glevels = 16, ncores = ncores)
# })
ailich commented 7 months ago

@hakimabdi, I noticed there's a bit more work to be done here. I'm seeing issues stitching the chunks together when the window size is not 3x3 or when na.rm=TRUE. There's something wrong with the calculated buffer so I'll need to look into that.

ailich commented 7 months ago

@hakimabdi, I figured out the issue. The buffer needs to be applied to the starting point of later chunks so that the first few rows have access to the cells above them for focal calculations, and then the buffer area needs to be removed before merging so that the proper version is stored in areas of overlap between the chunks.

Old Code

library(GLCMTextures)
#> Loading required package: terra
#> terra 1.7.62
library(future)
library(future.apply)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:terra':
#> 
#>     intersect, union
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

# Set up
dims<- 100
n_grey<- 16
set.seed(5)
rq<-rast(matrix(data = sample(0:(n_grey-1), size = dims^2, replace = TRUE), nrow = dims, ncol=dims))
ncores<- parallelly::availableCores(omit=1)

w<- c(7,7)

# Create breaks for chunking raster
buffer<- (w[1]-1)/2 # Buffer for focal window
breaks<- data.frame(start = ceiling(seq(1, nrow(rq), length.out = ncores+1)))
breaks<- breaks %>% mutate(end=lead(start, n=1)+buffer)
breaks$end[breaks$end > nrow(rq)]<- nrow(rq)
breaks<- breaks[-nrow(breaks),]

# Put raster chunks in list
r_list<- vector(mode = "list", length = ncores)
for (i in 1:ncores) {
  r_list[[i]]<- wrap(rq[breaks$start[i]:breaks$end[i], ,drop=FALSE])
}

glcmtextures_parallel<- function(x, ...){
  wrap(glcm_textures(unwrap(x),...))
}

serial<- glcm_textures(rq, w=w, n_levels = n_grey, quantization = "none", na.rm=TRUE)

plan(strategy = "multisession", workers=ncores) #Set up parallel
res_list<- future_lapply(r_list, FUN = glcmtextures_parallel, w=w, n_levels = n_grey, quantization = "none", na.rm=TRUE)

plan(strategy = "sequential") #Go back to serial processing
res_list<- lapply(res_list, unwrap)
parallel<- do.call(terra::merge, res_list) #Merge back into single raster

plot(parallel==serial) #Errors at seams of chunks

Created on 2023-12-08 with reprex v2.0.2

New Code

library(GLCMTextures)
#> Loading required package: terra
#> terra 1.7.62
library(future)
library(future.apply)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:terra':
#> 
#>     intersect, union
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

combine_raster_chunks<- function(x, buffer){
  for (i in 1:length(x)) {
    if(i!=1){x[[i]]<- x[[i]][(buffer+1):nrow(x[[i]]), , , drop=FALSE]} #Trim top of chunk
    if(i!=length(x)){x[[i]]<- x[[i]][1:(nrow(x[[i]])-buffer), , , drop=FALSE]} #Trim bottom of chunk
  }
  out<- do.call(merge, x)
  return(out)
}

# Set up
dims<- 100
n_grey<- 16
set.seed(5)
rq<-rast(matrix(data = sample(0:(n_grey-1), size = dims^2, replace = TRUE), nrow = dims, ncol=dims))
ncores<- parallelly::availableCores(omit=1)

w<- c(7,7)

# Create breaks for chunking raster
buffer<- (w[1]-1)/2 # Buffer for focal window
breaks<- data.frame(write_start = ceiling(seq(1, nrow(rq)+1, length.out = ncores+1)))
breaks<- breaks %>% mutate(write_end=lead(write_start, n=1)-1)
breaks<- breaks %>% mutate(chunk_start=write_start - buffer)
breaks<- breaks %>% mutate(chunk_end = write_end + buffer)
breaks<- breaks[-nrow(breaks),]
breaks$chunk_end[breaks$chunk_end > nrow(rq)]<- nrow(rq)
breaks$chunk_start[breaks$chunk_start < 1]<- 1

# Put raster chunks in list
r_list<- vector(mode = "list", length = ncores)
for (i in 1:ncores) {
  r_list[[i]]<- wrap(rq[breaks$chunk_start[i]:breaks$chunk_end[i], ,drop=FALSE])
}

glcmtextures_parallel<- function(x, ...){
  wrap(glcm_textures(unwrap(x),...))
}

serial<- glcm_textures(rq, w=w, n_levels = n_grey, quantization = "none", na.rm=TRUE)

plan(strategy = "multisession", workers=ncores) #Set up parallel
res_list<- future_lapply(r_list, FUN = glcmtextures_parallel, w=w, n_levels = n_grey, quantization = "none", na.rm=TRUE)
plan(strategy = "sequential") #Go back to serial processing
res_list<- lapply(res_list, unwrap)

parallel<- combine_raster_chunks(res_list, buffer) #Merge back into single raster

plot(parallel==serial) #Values match exactly

Created on 2023-12-08 with reprex v2.0.2