RETURN-project / makeDataCube

Data management
Apache License 2.0
0 stars 0 forks source link

Parallelize by splitting in the time domain #36

Closed PabRod closed 3 years ago

PabRod commented 3 years ago

Idea

Run several instances of makeDataCube corresponding to different time windows. This will allow further parallelization.

Comments

We could also run instances corresponding to different land patches, but this will be more complicated. The main reason is that patches are 2D surfaces, and thus, their boundaries have a shape. Avoiding overlapping and/or missing points will become tricky.

PabRod commented 3 years ago

First exploration

As a first step, I've run makeDataCube in its current state for two different time windows:

Notice that the short run is included in the long run.

Purpose

My purpose is to check which files are common for all runs, and which are dependent on the time window.

Results

The command du data/ gives us a detailed description of the sizes of all folders in data/. When applied to the data generated by, respectively, the long and short run, we get:

Data generated by long

16  ./temp
12  ./param
532 ./level2/mosaic
269244  ./level2/n-america/X0075_Y0056
1001936 ./level2/n-america/X0075_Y0055
91508   ./level2/n-america/X0076_Y0056
1027740 ./level2/n-america/X0076_Y0055
2390436 ./level2/n-america
1905520 ./level2/s-america/X0044_Y0020
23788   ./level2/s-america/X0044_Y0019
15292   ./level2/s-america/X0043_Y0019
298100  ./level2/s-america/X0043_Y0020
2242708 ./level2/s-america
4633688 ./level2
4   ./misc/lc
4   ./misc/fire
4   ./misc/S2
4   ./misc/tc
101324  ./misc/dem
8955308 ./misc/meta
3525528 ./misc/wvp
12582180    ./misc
4   ./level1/sentinel
4   ./level1/landsat
271528  ./level1/233059/LE07_L1TP_233059_20001212_20170208_01_T1
163960  ./level1/233059/LT05_L1TP_233059_20010326_20161211_01_T1
129224  ./level1/233059/LT05_L1TP_233059_20010222_20161212_01_T1
221180  ./level1/233059/LE07_L1TP_233059_20010214_20170206_01_T1
231540  ./level1/233059/LE07_L1TP_233059_20010129_20170207_01_T1
235696  ./level1/233059/LE07_L1TP_233059_20010318_20170206_01_T1
111124  ./level1/233059/LT05_L1TP_233059_20010310_20161212_01_T1
152296  ./level1/233059/LT05_L1TP_233059_20001118_20161213_01_T1
144140  ./level1/233059/LT05_L1TP_233059_20010121_20161212_01_T1
239268  ./level1/233059/LE07_L1TP_233059_20001126_20170209_01_T1
260580  ./level1/233059/LE07_L1TP_233059_20001228_20170208_01_T1
2160540 ./level1/233059
2160556 ./level1
48  ./log
19376504    .

Data generated by short

28  ./temp
12  ./param
244 ./level2/mosaic
121668  ./level2/n-america/X0075_Y0056
437116  ./level2/n-america/X0075_Y0055
41456   ./level2/n-america/X0076_Y0056
460204  ./level2/n-america/X0076_Y0055
1060452 ./level2/n-america
850548  ./level2/s-america/X0044_Y0020
15448   ./level2/s-america/X0044_Y0019
9240    ./level2/s-america/X0043_Y0019
163160  ./level2/s-america/X0043_Y0020
1038404 ./level2/s-america
2099112 ./level2
4   ./misc/lc
4   ./misc/fire
4   ./misc/S2
4   ./misc/tc
101324  ./misc/dem
8961564 ./misc/meta
3525528 ./misc/wvp
12588436    ./misc
4   ./level1/sentinel
4   ./level1/landsat
163960  ./level1/233059/LT05_L1TP_233059_20010326_20161211_01_T1
129224  ./level1/233059/LT05_L1TP_233059_20010222_20161212_01_T1
221180  ./level1/233059/LE07_L1TP_233059_20010214_20170206_01_T1
235696  ./level1/233059/LE07_L1TP_233059_20010318_20170206_01_T1
111124  ./level1/233059/LT05_L1TP_233059_20010310_20161212_01_T1
861188  ./level1/233059
861204  ./level1
24  ./log
15548820    .

Conclusions

By comparing these two reports, we notice several things:

  1. misc/dem and misc/wvp are independent of the chosen time window.
  2. A longer time window translates to more, but not different level1 folders.
  3. A longer time window translates to the same number, but larger level2 folders.
    • The only difference is the amount of files contained in those folders.
  4. Additionally, the temp, misc/meta, and log folders are different in size. Interestingly, temp and misc/meta are larger in the short run.
    • The content in temp can be safely ignored.
    • The content of misc/meta, typically a couple of csv files with download information, is only used to get the level1 data.
    • The content of log has to be different. It logs things such as download timestamps and errors. We are considering creating a unified log.

Thanks, @wandadk, for the input here!

PabRod commented 3 years ago

Good news. In this snippet we can see that parallel can handle parallelization of system calls:

library(parallel)

# Create a system task
inputs <- 1:6
fun <- function(x) system("sleep 1") # Asks the system to be idle for 1 second

# Run serial and time it
start_time <- Sys.time()

res <- lapply(inputs, fun)

end_time <- Sys.time()
print(end_time - start_time) # Time difference of 6.185004 secs

# Run in parallel and time it
start_time <- Sys.time()

res <- mclapply(inputs, fun, mc.cores = 6) # Time difference of 1.066828 secs

end_time <- Sys.time()
print(end_time - start_time)
PabRod commented 3 years ago

Direct approach to parallelization with parallel::mcmapply (as in https://github.com/RETURN-project/makeDataCube/commit/5eed97c52009f785ed8b279e08dbf129fb522d03) causes the following exception:

Downloading Landsat metadata catalogue...
crc32c signature computed for local file (b'E2ZbYw==') doesn't match cloud-supplied digest (b'a5d2hA=='). Local file (/home/pablo/code/makeDCpar/data/misc/meta/index.csv.gz) will be deleted.
CommandException: 1 file/object could not be transferred.

Error: /home/pablo/code/makeDCpar/data/misc/meta/metadata_landsat.csv: Metadata catalogue does not exist.
Use the -u option to download / update the metadata catalogue
PabRod commented 3 years ago

Working prototype merged to master (https://github.com/RETURN-project/makeDataCube/commit/58cc82320e3801676a7894aaf93170b0a8b6fb0a).

Notice that it requires a minor tweak in FORCE (info here and in vignettes/make_Landsat_cube.Rmd). The tweak has been incorporated to FORCE since version https://github.com/davidfrantz/force/commit/b5685c9b7258d91bcf3a096eee31b7a349f994e6 (at the moment of writing these lines, part of the develop branch).

cc @wandadk