Open kyle-messier opened 7 months ago
@sigmafelix @eva0marques @mitchellmanware @dzilber @dawranadeep @larapclark
I've been exploring how we can implement a pipeline updates when there is new data available while also being smart enough to only run new data. For example, when calculating covariates with temporality, we only need to calculate the new 6 months and then append.
the targets
package has a discussion here. As you can see someone was trying to do this and the developer said this out of scope of targets
. As I was thinking about how we could construct lists of targets that use various target or tarchetype
functions to handle the checking, running, appending, I also came across the documentation of tar_meta
here. In the subsection storage-access
you'll see that it is highly advised not to use things like tar_read
or tar_meta
within targets since they are designed to updated or be updated.
So where does that leave us? I think we need to implement the checks, run, and append in our own R functions in the beethoven
code base. The approach that I suggest we use, and I think we were more or less including this information is portions of the pipeline data objects with a site_index
, is to create time_index
and location_index
variable or metadata object that accompanies created target objects. We could update the site_index
to be formatted such that space and time is information is discernable or make new, separate ones. Therefore, a target object can check that the requested time and location easily. If additional time or locations are being requested by the pipeline, then the functions called by targets are designed to read the old data, run the function with new data, and append the old and new data.
I want to clarify what I said in today's discussion. The picture above depicts the overall structure of the pipeline I'm working on (sorry for bad handwriting). In my understanding, the pipeline is supposed to represent a static procedure of data analytics; any dynamic components (i.e., periodic updates in raw data) should be declared externally, where my configuration csv file (aka punchcard.csv) works for invalidating nodes by reading the configuration file with meta_run()
. This practice can bypass potential issues in pointing the fixed file path that would not update the pipeline (as discussed in ropensci/targets#136).
One challenge is that we want to update the feature data and the models periodically, which is difficult to do with target
's static pipeline. Given that we have saved a file to another directory (as represented "external directory" in the figure), the pipeline can be run in the new period only by editing the configuration file's start and end dates. A new feature data object will be appended if anything exists in the directory we use to keep previous feature files. I think it can be done at the function level, for example:
#' Append previous data
post_append <- function(present_features, path_present_features, dir_features) {
nfiles <- list.files(dir = dir_features, full.names = TRUE)
if (!is.null(nfiles) & length(nfiles) > 0) {
feats_prev <- lapply(nfiles, qread)
feats_prev <- append(feats_prev, present_features)
feats_up <- data.table::rbindlist(feats_prev, fill = TRUE)
return(feats_up)
} else {
qsave(present_features, path_present_features)
return(present_features)
}
}
Then, how do we know the saved file is from the previous pipeline run? To deal with this issue, we could take a simple way of naming the file to be saved (e.g., containing start and end dates in the file name). We could add a function to beethoven
to set up the feature cache directory and validate the naming convention, etc.
The idea I brought up above is half-baked, and I believe there exist similar solutions in targets
framework. I will look at the manual and discussion regarding this issue.
thanks @sigmafelix. BTW I think it is a beautiful hand-drawn figure. I think the solution you present here could work. Although, I'd also like to continue to look at tar_files
and tar_files_raw
as I think they may be dealing with this issue. To be continued...
Thank you for the visual, Insang, it is very helpful. I see the pipeline in the same way where the external editing of the configuration file dates/years triggers the rest of the pipeline to run.
I think it would be difficult to implement the dynamic branching for each of the covariate download/processing/calculation steps, but using it at the dt_combined
target may alleviate the need for external directory at the appending step. The way I understand it, the dynamic branches can be triggered by a target
meeting a certain "new features" condition, which we can set with a new function. If this "new features" condition is met, the dynamic branch runs the subsequent targets, which can include a tar_read()
from the already existing features onto which the new features are appended. This would not violate the tar_read()
within functions because the function would only trigger the branch to run.
Not sure if I have interpreted the documentation correctly, but I am still reading and working on an example.
Thank you @insang. I would personally add an option in the configuration file to give the possibility to the user to choose between the followings (at the model step):
@sigmafelix Here is some code snippet for a 2-target combination that reads and saves a SpatRaster object in targets - based off some discussion here
tar_target(
olm_clay_files,
unlist(list.files("/Volumes/set/Projects/PrestoGP_Pesticides/input/OpenLandMapData/Clay_Content/",pattern = "*.tif",full.names = TRUE))
),
tar_target(# These targets are the raw OLM files
name = olm_clay_crop,
command = olm_read_crop(olm_clay_files),
format = tar_format(
read = function(path) terra::rast(path),
write = function(object, path) terra::writeRaster(x = object, filename = path, filetype = "GTiff", overwrite = TRUE),
marshal = function(object) terra::wrap(object),
unmarshal = function(object) terra::unwrap(object)
)
)
If you try to use a regular format = "file"
approach, then the Cpp pointers get messed up and there are errors. I'm not entirely sold on the unlist(list.files(.))
approach I have implemented here. Obviously for beethoven
you have the punchcard
so no hard paths like I have here.
@kyle-messier Thank you for pointing me to the file-based workflow. geotargets supports interface for terra
objects, which we could use for the projects using single files. For beethoven
, we use multiple files per target in most of the cases; that said, I think that keeping different recipes/punchcards/blueprints for different periods with format = "file"
in tarchetypes::tar_files_input
(similar to what we discussed two weeks ago) is a great option for our case.
@sigmafelix geotargets
appears to made exactly for what I mentioned. It also got me wondering more about the functionality of a SpatRasterCollection
. That could be useful.
I didn't follow the conversation so I apologies if I am off-topic but I did wrote functions to convert to and from SpatRasterDataset
, which was more adapted to time dimension addition than SpatRasterCollection
.
@eva0marques no worries - I'm not as familiar with the specifics of the SpatRasterDataset
vs the SpatRasterCollection
. Perhaps we can discuss briefly at the group meeting today.
Sure :) This is what is explained on terra website:
A SpatRasterDataset
is a collection of sub-datasets, where each is a SpatRaster for the same area (extent) and coordinate reference system, but possibly with a different resolution. Sub-datasets are often used to capture variables (e.g. temperature and precipitation), or a fourth dimension (e.g. height, depth or time) if the sub-datasets already have three dimensions (multiple layers).
A SpatRasterCollection
is a collection of SpatRasters with no restriction in the extent or other geometric parameters.
future.batchtools
functions)crew
and crew.cluster
Approaches and discussion on how we implement an updatable pipeline when new AQS data becomes available