Learn the differences between different types of targets

github-learning-lab[bot] commented 4 years ago

remake is the R package that underlies many of scipiper's functions. Here we've borrowed some text from the remake github repo (credit to richfitz, although we've lightly edited the original text) to explain differences between targets

Targets

"Targets" are the main things that remake interacts with. They represent things that are made (they're also the vertices of the dependency graph). If you want to make a plot called plot.pdf, then that's a target. If you depend on a dataset called data.csv, that's a target (even if it already exists).

There are several types of targets:

files: The name of a file target is the same as its path. Something is actually stored in the file, and it's possible for the file contents to be modified outside of remake (files are the main types of targets that make deals with, since it is language agnostic). Within files, there are two sub-types:
- implicit: these are file targets that are depended on somewhere in your process, but for which no rule to build them exists (i.e., there is no command in a remakefile). You can't build these of course. However, remake will build an implicit file target for them so it can internally monitor changes to that file.
- explicit: these are the file targets that are built by rules that were defined within your pipeline (i.e., command-to-target recipe exists in a remakefile).
objects: These are R objects that represent intermediate objects in an analysis. However, these objects are transparently stored to disk so that they persist across R sesssions. Unlike actual R objects though they won't appear in your workspace and a little extra work is required to get at them.
fake: Fake targets are simply pointers to other targets (in make these are "phoney" targets). The all depends on all the "end points" of your analysis is a "fake" target. Running scmake("all") will build all of your targets, or verify that they are up to date.

:keyboard: Activity: Assign yourself to this issue to get started.

I'll sit patiently until you've assigned yourself to this one.

github-learning-lab[bot] commented 4 years ago

More details on object targets

As stated above, object targets are R objects that represent intermediate objects in an analysis.

"R objects" are common in the example pipelines we have shown before. They are distinguished from file targets in the following ways:

The target name does not have a file extension (e.g., ".csv") and resembles an R variable name (because that is basically what the object target is)
The function that creates the target returns some data to generate the target as opposed to writing to or creating a file, a la write.file(target_file) (there are all kinds of functions that write files, including write.csv, cat, write_feather, nc_create, etc). Data can be returned from a function either because R functions return the value of the last expression evaluated or because the function explicitly specifies what is returned, such as using return(target_data).

These objects are often used because they offer a brevity advantage over files (e.g., water_quality_values vs 1_fetch/out/water_quality_values.csv) and preserve the classes and formatting of the data, which makes it a bit easier to keep dates, factors, and other special data types from changing when you write - and then later read in - a file (such as a .csv). Objects also give you the illusion that they aren't taking up space in your project directory and make workspaces look a bit tidier.

The "illusion" :tophat::rabbit: of objects not taking up space is because behind the scenes, these objects are actually written to file (.rds files, to be specific). You can see what exists under the hood with dir('.remake/objects/data')

And I was able to take a look at that same object referenced in https://github.com/collnell/ds-pipelines-2/issues/2 by using

readRDS('.remake/objects/data/0e8d236d17d49a764c3fe2aaef0d2491.rds')
$missing_data
[1] "grey90"

$plot_CRS
[1] "+init=epsg:2163"

$wfs
[1] "http://cida.usgs.gov/gdp/geoserver/wfs"

$feature
[1] "derivative:wbdhu8_alb_simp"

$countBins
 [1]    0    1    2    5   10   20   50  100  200  500 1000

(A lot funkier than accessing the data with scmake('map.config') instead, which is what we'd recommend).

:keyboard: Add a comment to this issue so we know you're ready to continue learning

I'll sit patiently until you've added a comment to this issue.

cnell-usgs commented 4 years ago

let's keep going!

github-learning-lab[bot] commented 4 years ago

More details on file targets

File targets are very flexible and of course, are also easy to share or store elsewhere.

Additionally, many targets are either language agnostic (e.g., csv, tsv, txt, nc files) or are meant to be shared across languages, such as the how the feather file was designed for exchange between R and Python.

When specifying a target in a remakefile recipe with file targets, the path to the file needs to be either absolute or relative to the working directory that the remake.yml file is in.

Most of the guidance you'd see on the remake package whould steer you away from using files as targets, since the benefits of files are quite small compared to the advantages of using objects. In fact, one of the edits I made to the background on target types that was borrowed from remake was to remove the statement "With remake though, [file targets] should probably only be the beginning or end points of an analysis", which is referring to the end products of a pipeline likely being figures, tables, markdown files, or documents (all files) and encouraging all other targets to be objects. For reasons that will become cleared in the future, we instaead recommend that files be used more liberally than objects because of two reasons: 1) ability to store data remotely in file format, and 2) ease of collaboration. You'll here more about this in the intermediate pipelines courses and when you see some more of the team's pipelines in practice.

:keyboard: Activity: Close this issue when you are ready to move on to the next assignment

I'll sit patiently until this issue is closed.

github-learning-lab[bot] commented 4 years ago

cnell-usgs / ds-pipelines-2