Closed PabRod closed 3 years ago
As a first step, I've run makeDataCube
in its current state for two different time windows:
starttime <- c(2001,2,01)
and endtime <- c(2001,5,28)
.starttime <- c(2000,11,01)
and endtime <- c(2001,5,28)
.Notice that the short run is included in the long run.
My purpose is to check which files are common for all runs, and which are dependent on the time window.
The command du data/
gives us a detailed description of the sizes of all folders in data/
. When applied to the data generated by, respectively, the long and short run, we get:
16 ./temp
12 ./param
532 ./level2/mosaic
269244 ./level2/n-america/X0075_Y0056
1001936 ./level2/n-america/X0075_Y0055
91508 ./level2/n-america/X0076_Y0056
1027740 ./level2/n-america/X0076_Y0055
2390436 ./level2/n-america
1905520 ./level2/s-america/X0044_Y0020
23788 ./level2/s-america/X0044_Y0019
15292 ./level2/s-america/X0043_Y0019
298100 ./level2/s-america/X0043_Y0020
2242708 ./level2/s-america
4633688 ./level2
4 ./misc/lc
4 ./misc/fire
4 ./misc/S2
4 ./misc/tc
101324 ./misc/dem
8955308 ./misc/meta
3525528 ./misc/wvp
12582180 ./misc
4 ./level1/sentinel
4 ./level1/landsat
271528 ./level1/233059/LE07_L1TP_233059_20001212_20170208_01_T1
163960 ./level1/233059/LT05_L1TP_233059_20010326_20161211_01_T1
129224 ./level1/233059/LT05_L1TP_233059_20010222_20161212_01_T1
221180 ./level1/233059/LE07_L1TP_233059_20010214_20170206_01_T1
231540 ./level1/233059/LE07_L1TP_233059_20010129_20170207_01_T1
235696 ./level1/233059/LE07_L1TP_233059_20010318_20170206_01_T1
111124 ./level1/233059/LT05_L1TP_233059_20010310_20161212_01_T1
152296 ./level1/233059/LT05_L1TP_233059_20001118_20161213_01_T1
144140 ./level1/233059/LT05_L1TP_233059_20010121_20161212_01_T1
239268 ./level1/233059/LE07_L1TP_233059_20001126_20170209_01_T1
260580 ./level1/233059/LE07_L1TP_233059_20001228_20170208_01_T1
2160540 ./level1/233059
2160556 ./level1
48 ./log
19376504 .
28 ./temp
12 ./param
244 ./level2/mosaic
121668 ./level2/n-america/X0075_Y0056
437116 ./level2/n-america/X0075_Y0055
41456 ./level2/n-america/X0076_Y0056
460204 ./level2/n-america/X0076_Y0055
1060452 ./level2/n-america
850548 ./level2/s-america/X0044_Y0020
15448 ./level2/s-america/X0044_Y0019
9240 ./level2/s-america/X0043_Y0019
163160 ./level2/s-america/X0043_Y0020
1038404 ./level2/s-america
2099112 ./level2
4 ./misc/lc
4 ./misc/fire
4 ./misc/S2
4 ./misc/tc
101324 ./misc/dem
8961564 ./misc/meta
3525528 ./misc/wvp
12588436 ./misc
4 ./level1/sentinel
4 ./level1/landsat
163960 ./level1/233059/LT05_L1TP_233059_20010326_20161211_01_T1
129224 ./level1/233059/LT05_L1TP_233059_20010222_20161212_01_T1
221180 ./level1/233059/LE07_L1TP_233059_20010214_20170206_01_T1
235696 ./level1/233059/LE07_L1TP_233059_20010318_20170206_01_T1
111124 ./level1/233059/LT05_L1TP_233059_20010310_20161212_01_T1
861188 ./level1/233059
861204 ./level1
24 ./log
15548820 .
By comparing these two reports, we notice several things:
misc/dem
and misc/wvp
are independent of the chosen time window.level1
folders.level2
folders.
temp
, misc/meta
, and log
folders are different in size. Interestingly, temp
and misc/meta
are larger in the short run.
temp
can be safely ignored. misc/meta
, typically a couple of csv
files with download information, is only used to get the level1
data. log
has to be different. It logs things such as download timestamps and errors. We are considering creating a unified log.Thanks, @wandadk, for the input here!
Good news. In this snippet we can see that parallel
can handle parallelization of system calls:
library(parallel)
# Create a system task
inputs <- 1:6
fun <- function(x) system("sleep 1") # Asks the system to be idle for 1 second
# Run serial and time it
start_time <- Sys.time()
res <- lapply(inputs, fun)
end_time <- Sys.time()
print(end_time - start_time) # Time difference of 6.185004 secs
# Run in parallel and time it
start_time <- Sys.time()
res <- mclapply(inputs, fun, mc.cores = 6) # Time difference of 1.066828 secs
end_time <- Sys.time()
print(end_time - start_time)
Direct approach to parallelization with parallel::mcmapply
(as in https://github.com/RETURN-project/makeDataCube/commit/5eed97c52009f785ed8b279e08dbf129fb522d03) causes the following exception:
Downloading Landsat metadata catalogue...
crc32c signature computed for local file (b'E2ZbYw==') doesn't match cloud-supplied digest (b'a5d2hA=='). Local file (/home/pablo/code/makeDCpar/data/misc/meta/index.csv.gz) will be deleted.
CommandException: 1 file/object could not be transferred.
Error: /home/pablo/code/makeDCpar/data/misc/meta/metadata_landsat.csv: Metadata catalogue does not exist.
Use the -u option to download / update the metadata catalogue
Working prototype merged to master
(https://github.com/RETURN-project/makeDataCube/commit/58cc82320e3801676a7894aaf93170b0a8b6fb0a).
Notice that it requires a minor tweak in FORCE (info here and in vignettes/make_Landsat_cube.Rmd
). The tweak has been incorporated to FORCE since version https://github.com/davidfrantz/force/commit/b5685c9b7258d91bcf3a096eee31b7a349f994e6 (at the moment of writing these lines, part of the develop branch).
cc @wandadk
Idea
Run several instances of
makeDataCube
corresponding to different time windows. This will allow further parallelization.Comments
We could also run instances corresponding to different land patches, but this will be more complicated. The main reason is that patches are 2D surfaces, and thus, their boundaries have a shape. Avoiding overlapping and/or missing points will become tricky.