Closed Nowosad closed 2 years ago
Hi @micha-silver - I have thought about this issue a bit. My current thinking is:
@micha-silver, what do you think? Do you have other ideas? Btw - could you (privately) share your raster data with me? I would use it for the development and testing of the new code...
Hi jakub Thanks for looking into this. I think that the option for parallel processing over sub-regions would be very attractive, and would outweigh any problems with border superpixels. (After classifying the superpixels, we intend to dissolve along the border the superpixels with the same category.)
The full Geotiff of our study region is 4.6 GB (14 bands, with 0.5 meter resolution). I'll try to upload to Google Drive and send you the link. Regards, Micha
I made some progress on this issue (see the image below). I plan to find some more time for this in the next few days and will keep you updated with any changes.
Hi @micha-silver - now I have something to show you - you can find the new code in the chunks branch. Long story short - I added a new argument chunks
- "Should the input (x
) be split into chunks before deriving supercells? Either FALSE
(default), TRUE
(only large input objects are split), or a numeric value (representing the side length of the chunk in the number of cells)."
remotes::install_github("nowosad/supercells@chunks")
library(supercells)
library(terra)
library(sf)
stk = rast("~/Science/stk_full.tif")
stk
## class : SpatRaster
## dimensions : 16346, 11290, 14 (nrow, ncol, nlyr)
## resolution : 0.5, 0.5 (x, y)
## extent : 320766, 326411, 7882586, 7890759 (xmin, xmax, ymin, ymax)
## coord. ref. : WGS 84 / UTM zone 33S (EPSG:32733)
## source : stk_full.tif
## names : blue, green, red, nir, variance, contrast, ...
## min values : -5.746252e+01, -5.687078e+01, -8.405495e+01, -8.310928e+01, 3.590739e+00, 0.000000e+00, ...
## max values : 5.771508e+02, 7.438765e+02, 1.014005e+03, 1.151568e+03, 9.084221e+02, 1.386000e+02, ...
For example, when chunks
is set to 5000 then the input data will be split into 5000 by 5000 not-overlapping squares, superpixels will be created for each square, and the results for each square will be merged. Using these squares one by one limits the RAM usage.
It took about 2 hours on my computer to get the results for the whole area.
system.time({superpixels = supercells(x = stk,
step = 20,
compactness = 0.5,
dist_fun = "jsd",
minarea = 16,
chunks = 5000)})
## user system elapsed
## 8378.066 67.886 8465.853
The new updates also allow to use the future package. If you specify your plan()
properly, then superpixels for the "squares" will be calculated in parallel.
For this example, I used 12 cores, which returned results in about 15 minutes (~9 times faster).
library(future)
plan(multisession, workers = 12)
system.time({superpixels = supercells(x = stk,
step = 20,
compactness = 0.5,
dist_fun = "jsd",
minarea = 16,
chunks = 5000)})
## user system elapsed
## 67.417 4.127 916.180
@micha-silver please test this code and let me know what you think. Any feedback about computations or the API will be useful.
Hi Jakub: Many thanks for working on this! We will indeed test and report back.
Regards, Micha
On Sun, Oct 31, 2021 at 9:45 AM Jakub Nowosad @.***> wrote:
Hi @micha-silver https://github.com/micha-silver - now I have something to show you - you can find the new code in the chunks branch. Long story short - I added a new argument chunks - "Should the input (x) be split into chunks before deriving supercells? Either FALSE (default), TRUE (only large input objects are split), or a numeric value (representing the side length of the chunk in the number of cells)."
@.***")
library(supercells) library(terra) library(sf)
stk = rast("~/Science/stk_full.tif")stk
class : SpatRaster
dimensions : 16346, 11290, 14 (nrow, ncol, nlyr)
resolution : 0.5, 0.5 (x, y)
extent : 320766, 326411, 7882586, 7890759 (xmin, xmax, ymin, ymax)
coord. ref. : WGS 84 / UTM zone 33S (EPSG:32733)
source : stk_full.tif
names : blue, green, red, nir, variance, contrast, ...
min values : -5.746252e+01, -5.687078e+01, -8.405495e+01, -8.310928e+01, 3.590739e+00, 0.000000e+00, ...
max values : 5.771508e+02, 7.438765e+02, 1.014005e+03, 1.151568e+03, 9.084221e+02, 1.386000e+02, ...
For example, when chunks is set to 5000 then the input data will be split into 5000 by 5000 not-overlapping squares, superpixels will be created for each square, and the results for each square will be merged. Using these squares one by one limits the RAM usage.
It took about 2 hours on my computer to get the results for the whole area.
system.time({superpixels = supercells(x = stk, step = 20, compactness = 0.5, dist_fun = "jsd", minarea = 16, chunks = 5000)})
user system elapsed
8378.066 67.886 8465.853
The new updates also allow to use the future package. If you specify your plan() properly, then superpixels for the "squares" will be calculated in parallel.
For this example, I used 12 cores, which returned results in about 15 minutes (~9 times faster).
library(future)
plan(multisession, workers = 12) system.time({superpixels = supercells(x = stk, step = 20, compactness = 0.5, dist_fun = "jsd", minarea = 16, chunks = 5000)})
user system elapsed
67.417 4.127 916.180
@micha-silver https://github.com/micha-silver please test this code and let me know what you think. Any feedback about computations or the API will be useful.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Nowosad/supercells/issues/10#issuecomment-955652729, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA5J52BKDLYXEYK6E7Z2HIDUJTX3BANCNFSM472LQNQQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
-- Micha Silver Ben Gurion Univ. Sde Boker, Remote Sensing Lab cell: +972-523-665918 https://orcid.org/0000-0002-1128-1325
Hi Jakub
The function with the chunk option runs to completion, with no memory problems. However the output does not cover the whole region. It seems some of the chunks are missed somehow:
The full region
Zoom to intersection
Further details:
> str(superpixels)
Classes ‘sf’ and 'data.frame': 505191 obs. of 22 variables:
$ supercells : num 1 2 3 4 5 6 7 8 9 10 ...
$ x : num 320772 320772 320772 320772 320772 ...
$ y : num 7890753 7890744 7890734 7890724 7890714 ...
$ blue : num 3.2e-06 3.2e-06 3.2e-06 3.2e-06 3.2e-06 ...
.... (other variables)
$ geometry :sfc_MULTIPOLYGON of length 505191; first list element: List of 1
..$ :List of 1
.. ..$ : num [1:5, 1:2] 0.000354 0.000354 0.00372 0.00372 0.000354 ...
..- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
- attr(*, "sf_column")= chr "geometry"
- attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA NA NA NA ...
..- attr(*, "names")= chr [1:21] "supercells" "x" "y" "blue" ...
Note the xmin, ymin values in the bbox:
> st_bbox(superpixels)
xmin ymin xmax ymax
0.0 0.0 326409.5 7890758.0
When I zoom in to coords (0,0) i can see the "missing" chunks, all overlapped between (0,0) and (1,1).
??
I would add that the first time I ran this, (using future
and plan(multicore)
) I got all the chunks overlapped between (0,0) and (1,1). Then I reran without future
and this is the result. Furthermore, I did not see any time gain when using future
. In either case the supercells()
completed on the full region in about 80 mins.
On my system:
> sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 11 (bullseye)
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.13.so
locale:
[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
[5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
[7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] fusionImage_0.0.1 supercells_0.3.0 remotes_2.4.1
[4] exactextractr_0.7.1 glcm_1.6.5 RStoolbox_0.2.6
[7] terra_1.4-11 raster_3.5-2 sp_1.4-5
[10] sf_1.0-3 forcats_0.5.1 stringr_1.4.0
[13] dplyr_1.0.7 purrr_0.3.4 readr_2.0.2
[16] tidyr_1.1.4 tibble_3.1.5 ggplot2_3.3.5
[19] tidyverse_1.3.1
loaded via a namespace (and not attached):
[1] nlme_3.1-153 fs_1.5.0 lubridate_1.8.0
[4] doParallel_1.0.16 httr_1.4.2 tools_4.1.1
[7] backports_1.3.0 rgdal_1.5-27 utf8_1.2.2
[10] R6_2.5.1 rpart_4.1-15 KernSmooth_2.23-20
[13] rgeos_0.5-8 DBI_1.1.1 colorspace_2.0-2
[16] nnet_7.3-16 withr_2.4.2 tidyselect_1.1.1
[19] curl_4.3.2 compiler_4.1.1 cli_3.1.0
[22] rvest_1.0.2 xml2_1.3.2 scales_1.1.1
[25] classInt_0.4-3 proxy_0.4-26 digest_0.6.28
[28] pkgconfig_2.0.3 parallelly_1.28.1 dbplyr_2.1.1
[31] rlang_0.4.12 readxl_1.3.1 rstudioapi_0.13
[34] generics_0.1.1 jsonlite_1.7.2 ModelMetrics_1.2.2.2
[37] magrittr_2.0.1 geosphere_1.5-14 Matrix_1.3-4
[40] Rcpp_1.0.7 munsell_0.5.0 fansi_0.5.0
[43] lifecycle_1.0.1 stringi_1.7.5 pROC_1.18.0
[46] MASS_7.3-54 plyr_1.8.6 recipes_0.1.17
[49] grid_4.1.1 parallel_4.1.1 listenv_0.8.0
[52] crayon_1.4.2 lattice_0.20-45 haven_2.4.3
[55] splines_4.1.1 hms_1.1.1 pillar_1.6.4
[58] future.apply_1.8.1 reshape2_1.4.4 codetools_0.2-18
[61] stats4_4.1.1 reprex_2.0.1 XML_3.99-0.8
[64] glue_1.4.2 philentropy_0.6.0.9000 data.table_1.14.2
[67] modelr_0.1.8 vctrs_0.3.8 tzdb_0.2.0
[70] foreach_1.5.1 cellranger_1.1.0 gtable_0.3.0
[73] future_1.23.0 assertthat_0.2.1 gower_0.2.2
[76] prodlim_2019.11.13 broom_0.7.10 e1071_1.7-9
[79] class_7.3-19 survival_3.2-13 timeDate_3043.102
[82] iterators_1.0.13 units_0.7-2 lava_1.6.10
[85] globals_0.14.0 ellipsis_0.3.2 caret_6.0-90
[88] ipred_0.9-12
I want to add another comment, not related to the above. I see that regions where the raster value is NA get split into square superpixels:
Is that intended? Is that a result of my minarea
parameter value? I would expect no superpixels where there is no data in the raster.
Thanks, Micha
Hi @micha-silver - thanks for the feedback. I will work on the first issue soon. Regarding the second issue - the black cells in your image above are not set as no data -- they actually contain some values. You need to set these values to nodata in each layer before creating superpixels.
Hi @micha-silver - I tested the code again -- it worked fine with and without the parallelization.
Could you try to install the updates again (I added tiny changes) and run the (non-paralleled) code in a fresh R session? You could maybe also try to install the most recent terra first if possible...
remotes::install_github("nowosad/supercells@chunks")
My session info:
─ Session info ────────────────────────────────────────────────────────────────────────────────────────────────────────
setting value
version R version 4.1.0 (2021-05-18)
os Red Hat Enterprise Linux
system x86_64, linux-gnu
ui RStudio
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz Europe/Warsaw
date 2021-11-06
─ Packages ────────────────────────────────────────────────────────────────────────────────────────────────────────────
! package * version date lib source
cachem 1.0.6 2021-08-19 [1] CRAN (R 4.1.0)
callr 3.7.0 2021-04-20 [1] CRAN (R 4.1.0)
class 7.3-19 2021-05-03 [2] CRAN (R 4.1.0)
classInt 0.4-3 2020-04-07 [1] CRAN (R 4.1.0)
cli 3.1.0 2021-10-27 [1] CRAN (R 4.1.0)
codetools 0.2-18 2020-11-04 [2] CRAN (R 4.1.0)
crayon 1.4.2 2021-10-29 [1] CRAN (R 4.1.0)
DBI 1.1.1 2021-01-15 [1] CRAN (R 4.1.0)
desc 1.3.0 2021-03-05 [1] CRAN (R 4.1.0)
devtools 2.4.2 2021-06-07 [1] CRAN (R 4.1.0)
digest 0.6.28 2021-09-23 [1] CRAN (R 4.1.0)
dplyr 1.0.7 2021-06-18 [1] CRAN (R 4.1.0)
e1071 1.7-9 2021-09-16 [1] CRAN (R 4.1.0)
ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.1.0)
evaluate 0.14 2019-05-28 [1] CRAN (R 4.1.0)
fansi 0.5.0 2021-05-25 [1] CRAN (R 4.1.0)
fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.1.0)
fs 1.5.0 2020-07-31 [1] CRAN (R 4.1.0)
future * 1.22.1 2021-08-25 [1] CRAN (R 4.1.0)
future.apply 1.8.1 2021-08-10 [1] CRAN (R 4.1.0)
generics 0.1.1 2021-10-25 [1] CRAN (R 4.1.0)
globals 0.14.0 2020-11-22 [1] CRAN (R 4.1.0)
glue 1.4.2 2020-08-27 [1] CRAN (R 4.1.0)
htmltools 0.5.2 2021-08-25 [1] CRAN (R 4.1.0)
KernSmooth 2.23-20 2021-05-03 [2] CRAN (R 4.1.0)
knitr 1.33 2021-04-24 [1] CRAN (R 4.1.0)
lifecycle 1.0.1 2021-09-24 [1] CRAN (R 4.1.0)
listenv 0.8.0 2019-12-05 [1] CRAN (R 4.1.0)
magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.1.0)
memoise 2.0.0 2021-01-26 [1] CRAN (R 4.1.0)
parallelly 1.28.1 2021-09-09 [1] CRAN (R 4.1.0)
philentropy 0.6.0.9000 2021-10-30 [1] Github (drostlab/philentropy@543e115)
pillar 1.6.4 2021-10-18 [1] CRAN (R 4.1.0)
pkgbuild 1.2.0 2020-12-15 [1] CRAN (R 4.1.0)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.1.0)
pkgload 1.2.1 2021-04-06 [1] CRAN (R 4.1.0)
prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.1.0)
processx 3.5.2 2021-04-30 [1] CRAN (R 4.1.0)
proxy 0.4-26 2021-06-07 [1] CRAN (R 4.1.0)
ps 1.6.0 2021-02-28 [1] CRAN (R 4.1.0)
purrr 0.3.4 2020-04-17 [1] CRAN (R 4.1.0)
R6 2.5.1 2021-08-19 [1] CRAN (R 4.1.0)
ragg 1.1.3 2021-06-09 [1] CRAN (R 4.1.0)
Rcpp 1.0.7 2021-07-07 [1] CRAN (R 4.1.0)
remotes 2.4.0 2021-06-02 [1] CRAN (R 4.1.0)
rlang 0.4.12 2021-10-18 [1] CRAN (R 4.1.0)
rmarkdown 2.10 2021-08-06 [1] CRAN (R 4.1.0)
rprojroot 2.0.2 2020-11-15 [1] CRAN (R 4.1.0)
rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.1.0)
sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.1.0)
sf * 1.0-3 2021-10-07 [1] CRAN (R 4.1.0)
P supercells * 0.3.0 2021-11-06 [?] local
systemfonts 1.0.2 2021-05-11 [1] CRAN (R 4.1.0)
terra * 1.4-9 2021-10-07 [1] Github (rspatial/terra@681d182)
testthat * 3.0.4 2021-07-01 [1] CRAN (R 4.1.0)
textshaping 0.3.5 2021-06-09 [1] CRAN (R 4.1.0)
tibble 3.1.5 2021-09-30 [1] CRAN (R 4.1.0)
tidyselect 1.1.1 2021-04-30 [1] CRAN (R 4.1.0)
units 0.7-2 2021-06-08 [1] CRAN (R 4.1.0)
usethis 2.0.1 2021-02-10 [1] CRAN (R 4.1.0)
utf8 1.2.2 2021-07-24 [1] CRAN (R 4.1.0)
vctrs 0.3.8 2021-04-29 [1] CRAN (R 4.1.0)
withr 2.4.2 2021-04-18 [1] CRAN (R 4.1.0)
xfun 0.25 2021-08-06 [1] CRAN (R 4.1.0)
yaml 2.2.1 2020-02-01 [1] CRAN (R 4.1.0)
On 06/11/2021 14:57, Jakub Nowosad
wrote:
Hi @micha-silver
- I tested the code again -- it worked fine with and without the
parallelization.
Could you try to install the updates again (I added tiny
changes) and run the (non-paralleled) code in a fresh R session?
You could maybe also try to install the most recent terra
first if possible...
Yes, sure. I'll do some more testing and let you know.
I reinstalled supercells, the chunks
branch, and also terra from github.
Here is what I'm getting now on our server (Debian 11).
> system.time(superpixels <- supercells(x = stk_full, step=20, compactness=2, minarea = 16,chunks = 6000), gcFirst=TRUE)
Step: 20
Initialization: Completed
Iteration: 10/10
Cleaning connectivity: Completed
Error in ...future.FUN(...future.X_jj, ...) : object 'sf' not found
Timing stopped at: 190.5 3.419 193.9
I'll do some more testing on other machines...
@micha-silver - there was a tiny typo in the code -- :
instead of ::
. Now it is fixed. Please reinstall it once more and try it.
Also -- if any problem still occur -- I could try to split parallel and non-parallel options into two separate arguments for debugging...
@Nowosad
Step: 20
Initialization: Completed
Iteration: 10/10
Cleaning connectivity: Completed
Error in names(slic_sf) <- c("supercells", "x", "y", "geometry") :
'names' attribute [4] must be the same length as the vector [3]
In addition: Warning message:
In if (!in_memory(x)) { :+1:
the condition has length > 1 and only the first element will be used
Line 143 ??
Great find! This error occur when some chunks have only NA values. I am now working to fix it.
Ok -- now if some chunks are empty, the code still should work.
@Nowosad
Here's what we see so far. We ran this code successfully on one Windows computer (using plan(sequential)
) and on our Debian Bullseye server with both sequential
and multicore
plans. Both finished for the whole region. Strangely, both took about the same time.
So I'd say this issue can be closed.
However, on a third Debian machine (my home computer) it continues to produce superpixels with some of the chunks left at (0,0) :-(. At each run a seemingly random set of the chunks are left in the lower left corner, overlapping between (0,0) and (1,1).
My machine and the server are almost identical in software (I had R 4.1.x at home, so I removed everything and downgraded back to the Debian default installation of 4.0.5 to match the server version). But the problem persists. This is clearly something local. Unlike the server, I have an nVidia G-Force card with some GPU's on my home machine. I have an idea that's what is causing the trouble.
I ran a short test script to try future_lapply
by looping along a growing string of numbers, and calculating the mean each time. This ran fine: using sequential
took about 4 mins (1:1e6 numbers), and the multicore
and sequential
plans took about 15 and 20 secs each. So in general future
is working.
In order to debug this further, can you tell me what I have to change in the supercells()
function to revert back to no parallel. I'd like to try without loading future.apply
at all.
Again thanks for your help.
Would it be enough to replace line on 121: future.apply::future_apply
with a simple apply
??
Thanks for the update @micha-silver.
For your tests, I have added a new argument future
- https://github.com/Nowosad/supercells/commit/1dcf4c3580528e8243f45c4f951ad7e7475bff65. By default it is FALSE
, so the apply
function is used instead of future_apply()
.
@micha-silver - FYI - I merged a lot of changes to the master branch today. Closing this issue for now.
Thanks, To continue testing I decided to go back to the master branch, and try that. (No chunks, no future). And sure enough I got the same erratic behavior on both Debian machines. Sometimes properly placed, and more often not. But since this seems to be Debian specific, closing the issue is fine. Meanwhile I opened a new issue #409 at Robert Hijmans Rspatial repo.
Attempting to run
supercells
on a large study area (190 million pixels) caused an out of memory error. This was on a server with 128 GB RAM. We split the region into two subregions, about 90 million pixels each, and the process completed successfully in each subregion separately. With even smaller subregions, of about 50 million pixels, thesupercells
function completed without error on a desktop computer with 16GB RAM. Here's the function call with parameters that we used:After doing classification on the superpixels, we'll check the border between the subregions, if there is a need to merge any superpixels.