processProbaVbatch2: Directory name is not prepended

GreatEmerald commented 8 years ago

When I try to run processProbaVbatch2 as per the tutorial, I get no output; with 1 core, I get this error for each of the files to process:

[1] "/20160601/PROBAV_S5_TOC_20160601_100M_V001/PROBAV_S5_TOC_X20Y01_20160601_100M_V001_SM.tifnot processed, no quality flag (SM) file found!"

That's clearly the wrong path, it should be prepended with the directory name (first argument).

Here's the full call I'm making: https://github.com/GreatEmerald/master-classification/blob/master/src/composite-probav.r#L89

Also a minor issue: checking for pattern == "NDVI.tif$" is not ideal. I was trying to use a pattern of glob2rx("*D*I*.tif") (both "RADIOMETRY" and "NDVI") and I'm not sure if that would get handled properly (didn't seem to).

JornDallinga commented 8 years ago

I still need to update the tutorial, hope I can do that soon! Thanks for the heads up!

GreatEmerald commented 7 years ago

I found the problem: https://github.com/JornDallinga/JornDallinga.github.io/blob/master/R/processProbaVbatch2.R#L32 This results in gg being empty. If I replace gg with x2 three lines lower, it seems to work. I'm not really sure what that line is supposed to do to begin with...

JornDallinga commented 7 years ago

The use of this line depends on the directory input. the location to where you read the files from can be either one folder (probably in your case) or can be multiple folders (have to be updated case). Since not al folders available on the ProbaV directory (GeoTiff) are allowed to be used or published by us. I simply wrote this section to ignore folders that dont have the YYYYMMDD structure. My analysis are still running, after its done, I hope to publish the new website, this should fix most of the issues you are encountering.

GreatEmerald commented 7 years ago

Oh, I see. My input is "/data/MTDA/TIFFDERIVED/PROBAV_L3_S5_TOC_100M/", it does not end in a date (I want to process multiple dates, of course). It also ends with a slash, which might throw off some of the checks.

But even with the fix that makes paths correct, I get this error:

evaluation # 3:
$i
[1] "/data/MTDA/TIFFDERIVED/PROBAV_L3_S5_TOC_100M//20160611/PROBAV_S5_TOC_20160611_100M_V001/PROBAV_S5_TOC_X20Y01_20160611_100M_V001_NDVI.tif"

$o
[1] "../../composite/PROBAV_S5_TOC_X20Y01_20160611_100M_V001_NDVI_sm.tif"

...out: ../../composite/PROBAV_S5_TOC_X20Y01_20160611_100M_V001_NDVI_sm.tifreading data...TRUEreading SM...start overlay...result of evaluating expression:
<simpleError in (function (x, fun, filename = "", recycle = TRUE, forcefun = FALSE,     ...) {    ln <- length(x)    if (ln < 1) {        stop("no Rasters")    }    if (ln > 1) {        compareRaster(x)    }    nl <- sapply(x, nlayers)    maxnl <- max(nl)    filename <- trim(filename)    testmat <- NULL    testlst <- vector(length = length(x), mode = "list")    w <- getOption("warn")    options(warn = -1)    for (i in 1:length(testlst)) {        v <- extract(x[[i]], 1:5)        testmat <- cbind(testmat, as.vector(v))        testlst[[i]] <- v    }    options(warn = w)    test1 <- try(apply(testmat, 1, fun), silent = TRUE)    if (class(test1) != "try-error" & (!forcefun)) {        doapply <- TRUE        if (!is.null(dim(test1))) {            test1 <- t(test1)        }        else {            test1 <- matrix(test1, ncol = maxnl)        }        nlout <- NCOL(test1)    }    else {        doapply <- FALSE        dovec <- FALSE        test2 <- try(do.call(fun, testlst), silent = TRUE)        nlout <- length(test2)/5        if (class(test2) == "try-error" | length(test2) < 5) {            dovec <- TRUE            testlst <- lapply(testlst, as.vector)            test3 <- try(do.call(fun, testlst), silent = TRUE)            nlout <- length(test3)/5            if (class(test3) == "try-error" | length(test3) <                 5) {                stop("cannot use this formula, probably because it is not vectorized")            }        }    }    if (nlout == 1) {        out <- raster(x[[1]])    }    else {        out <- brick(x[[1]], values = FALSE, nl = nlout)    }    if (canProcessInMemory(out, sum(nl) + maxnl)) {        pb <- pbCreate(3, label = "overlay", ...)        pbStep(pb, 1)        if (doapply) {            valmat <- matrix(nrow = ncell(out) * maxnl, ncol = length(x))            for (i in 1:length(x)) {                if (ncell(x[[i]]) < nrow(valmat)) {                  options(warn = -1)                  valmat[, i] <- as.vector(getValues(x[[i]])) *                     rep(1, nrow(valmat))                  options(warn = w)                }                else {                  valmat[, i] <- as.vector(getValues(x[[i]]))                }            }            pbStep(pb, 2)            vals <- apply(valmat, 1, fun)            if (!is.null(dim(vals))) {                vals <- t(vals)            }            vals <- matrix(vals, nrow = ncell(out))        }        else {            for (i in 1:length(x)) {                x[[i]] <- getValues(x[[i]])            }            if (dovec) {                x <- lapply(x, as.vector)            }            pbStep(pb, 2)            vals <- do.call(fun, x)            vals <- matrix(vals, nrow = ncell(out))        }        pbStep(pb, 3)        out <- setValues(out, vals)        if (filename != "") {            out <- writeRaster(out, filename = filename, ...)        }        pbClose(pb)        return(out)    }    else {        if (filename == "") {            filename <- rasterTmpFile()        }        out <- writeStart(out, filename = filename, ...)        tr <- blockSize(out, n = sum(nl) + maxnl)        pb <- pbCreate(tr$n, label = "overlay", ...)        if (doapply) {            valmat = matrix(nrow = tr$nrows[1] * ncol(out) *                 maxnl, ncol = length(x))            for (i in 1:tr$n) {                if (i == tr$n) {                  valmat = matrix(nrow = tr$nrows[i] * ncol(out) *                     maxnl, ncol = length(x))                }                for (j in 1:length(x)) {                  v <- as.vector(getValues(x[[j]], row = tr$row[i],                     nrows = tr$nrows[i]))                  if (length(v) < nrow(valmat)) {                    options(warn = -1)                    valmat[, j] <- v * rep(1, nrow(valmat))                    options(warn = w)                  }                  else {                    valmat[, j] <- v                  }                }                vv <- apply(valmat, 1, fun)                if (!is.null(dim(vv))) {                  vals <- t(vv)                }                vv <- matrix(vv, ncol = nlout)                out <- writeValues(out, vv, tr$row[i])                pbStep(pb, i)            }        }        else {            vallist <- list()            for (i in 1:tr$n) {                if (dovec) {                  for (j in 1:length(x)) {                    vallist[[j]] <- as.vector(getValues(x[[j]],                       row = tr$row[i], nrows = tr$nrows[i]))                  }                }                else {                  for (j in 1:length(x)) {                    vallist[[j]] <- getValues(x[[j]], row = tr$row[i],                       nrows = tr$nrows[i])                  }                }                vv <- do.call(fun, vallist)                vv <- matrix(vv, ncol = nlout)                out <- writeValues(out, vv, tr$row[i])                pbStep(pb, i)            }        }        pbClose(pb)        out <- writeStop(out)    }    return(out)})(datatype = "FLT4S", overwrite = FALSE, fun = function (x,     y) {    x[!y %in% QC_val] <- NA    if (!is.null(fill)) {        x[x %in% fill] <- NA    }    return(x)}, filename = "../../composite/PROBAV_S5_TOC_X20Y01_20160611_100M_V001_NDVI_sm.tif",     recycle = TRUE, forcefun = FALSE, x = list(<S4 object of class structure("RasterBrick", package = "raster")>,         <S4 object of class structure("RasterLayer", package = "raster")>)): cannot use this formula, probably because it is not vectorized>
got results for task 3
numValues: 3, numResults: 3, stopped: FALSE
returning status FALSE

That's rather confusing...

JornDallinga commented 7 years ago

hmm, odd, let me see if I can rerun the function with your date and tiles

JornDallinga commented 7 years ago

if I run the function with the following settings

directory <- /data/MTDA/TIFFDERIVED/PROBAV_L3_S5_TOC_100M/20160611" outdir <- "/userdata/sm2" patterns <- c('NDVI.tif$') tiles <- c("X20Y01") QC_val <- getProbaVQClist()$clear_all start_date = "2016-06-11", end_date = "2016-06-11" ncores = 5

I am receiving no errors. btw, if you try to run the RADIOMETRY pattern, you might receive errors about deleting a filename, or that it cannot be found. this has to be fixed with some forking. Because it seems that running this function parallel prevents it from deleting redundant .tif files after the function is done. But the files have been written to disk and can be worked with.

GreatEmerald commented 7 years ago

Hm, but that is a single date. I'm trying to clean all images for the summer. Hence why I used

directory = "/data/MTDA/TIFFDERIVED/PROBAV_L3_S5_TOC_100M/"
start_date = "2016-06-01"
end_date = "2016-08-31"

JornDallinga commented 7 years ago

Ah, you can have multiple folders as input. The new tutorial will show that. Basicly you coerce a slection of folders with the YYYYMMDD structure, and ignoring the folders that we are not allowed to process yet (e.g.2014). The function will then collect all the tif files in your folder selection, depending on your start/end date. Ill try to send that update soon as well. Op 4 nov. 2016 17:58 schreef "Dainius Masiliūnas" <notifications@github.com

:

Hm, but that is a single date. I'm trying to clean all images for the summer. Hence why I used

directory = "/data/MTDA/TIFFDERIVED/PROBAV_L3_S5_TOC_100M/"start_date = "2016-06-01"end_date = "2016-08-31"

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/JornDallinga/JornDallinga.github.io/issues/2#issuecomment-258488014, or mute the thread https://github.com/notifications/unsubscribe-auth/AJ88VfjO8d5E5SL0ujUTwJPSk6v4zyoCks5q62QvgaJpZM4Kok4n .

GreatEmerald commented 7 years ago

So I have to manually get a list of all the directories and pass it to the function? I thought it did that inside the function already, given that getProbaVInfo() call in it does give the correct results... But all right, I'll try using list.dirs().

JornDallinga commented 7 years ago

I will take a quick look at the function tonight. The list.dirs aspect and subsetting YYYYMMDD is a quick and dirty solution to avoid selecting unwanted subfolders (e.g. 2014-2015). But without a doubt, this could be scripted better.

2016-11-07 9:19 GMT+01:00 Dainius Masiliūnas notifications@github.com:

So I have to manually get a list of all the directories and pass it to the function? I thought it did that inside the function already, given that getProbaVInfo() call in it does give the correct results... But all right, I'll try using list.dirs().

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/JornDallinga/JornDallinga.github.io/issues/2#issuecomment-258772167, or mute the thread https://github.com/notifications/unsubscribe-auth/AJ88VZSWVS5zpy8dJNojVUXCa_-14fZiks5q7t72gaJpZM4Kok4n .

GreatEmerald commented 7 years ago

OK, I did get it to work that way. However, I get a memory allocation error when running on all cores. So I suppose some algorithm is needed to either estimate the memory needed from file size+memory already in use and restrict the number of threads, or to handle the out of memory error, waiting until there is more memory free and then retrying.

JornDallinga commented 7 years ago

Agree. Found some nice tips on http://gforge.se/2015/02/how-to-go-parallel-in-r-basics-tips/. Dont think Johannes had time yet to incorporate the memory issue in his package. Let me know if you find some solutions!

GreatEmerald commented 7 years ago

Yes, some changes to Johannes' cleanProbaV() would be useful for this as well. For one to not use the magic filename string _sm.tif but rather have it as an argument with that string as default. And for another to not write the output into five rasters for radiometry. That probably gives quite a bit of memory/disk overhead, and generally is simply not necessary, one file with all layers would work just fine. Then there is also this strange way of loading the raster itself: it's calling readGDAL(as.is=TRUE) which in fact causes it to store the file in memory. I don't really know what as.is does, but this really hurts memory use compared to using plain brick(), which uses no memory whatsoever and is pretty much instant.

With these changes, the function should use no memory at all until the overlay() part. And it turns out that overlay() is quite smart and it checks for enough memory itself (https://github.com/cran/raster/blob/master/R/overlay.R#L107). If it finds that there is not enough memory and that there is an output filename, it will only process a small part of it at once and store everything on the disk.

So in theory, just fixing the cleanProbaV() function should solve the memory issue. Though in practice, the first few overlay() threads will eat all memory, and the next ones will be memory-starved and thus slow. Thus it's still best to do some sort of memory check before running foreach. raster provides nice functions called canProcessInMemory() and blockSize() with which you can try to predict how many threads to actually use. Though the challenge here is that canProcessInMemory() only tells you if it can store n layers of a particular raster(stack/brick), not how many files can be processed (it's also quite simplistic: https://github.com/cran/raster/blob/master/R/canProcessInMemory.R). So I'd imagine that ideally before running foreach you would already have all the files loaded as bricks, calculate the maximum size of them, then compare that against available memory and limit the number of threads accordingly. (Which would mean that the only thing cleanProbaV() does is overlay(), so at that point you might as well call it directly and skip cleanProbaV() altogether.)

GreatEmerald commented 7 years ago

For some statistics: with the current function, I ran out of 32 GiB of RAM when trying to clean 18 radiometry files using 16 cores (it threw an error with 3 files processed). 8 cores failed as well, with 6 files processed. htop showed that each thread used 5 GiB on average, and also consistently showed 9-10 GiB of memory being used by a each thread at one point during processing. So then I tested 3 cores, and it was just enough to not run out of memory. So memory usage could probably be improved, but at least this shows that at the moment limiting the number of cores to RAM/10GiB is about right.

JornDallinga commented 7 years ago

nice testing! Good to know.

JornDallinga / JornDallinga.github.io

processProbaVbatch2: Directory name is not prepended #2