cnell-usgs / ds-pipelines-2

https://lab.github.com/USGS-R/scipiper-tips-and-tricks
0 stars 0 forks source link

Learn the differences between different types of targets #5

Closed github-learning-lab[bot] closed 4 years ago

github-learning-lab[bot] commented 4 years ago

remake is the R package that underlies many of scipiper's functions. Here we've borrowed some text from the remake github repo (credit to richfitz, although we've lightly edited the original text) to explain differences between targets

Targets

"Targets" are the main things that remake interacts with. They represent things that are made (they're also the vertices of the dependency graph). If you want to make a plot called plot.pdf, then that's a target. If you depend on a dataset called data.csv, that's a target (even if it already exists).

There are several types of targets:


:keyboard: Activity: Assign yourself to this issue to get started.


I'll sit patiently until you've assigned yourself to this one.

github-learning-lab[bot] commented 4 years ago

More details on object targets

As stated above, object targets are R objects that represent intermediate objects in an analysis.

"R objects" are common in the example pipelines we have shown before. They are distinguished from file targets in the following ways:

These objects are often used because they offer a brevity advantage over files (e.g., water_quality_values vs 1_fetch/out/water_quality_values.csv) and preserve the classes and formatting of the data, which makes it a bit easier to keep dates, factors, and other special data types from changing when you write - and then later read in - a file (such as a .csv). Objects also give you the illusion that they aren't taking up space in your project directory and make workspaces look a bit tidier.

The "illusion" :tophat::rabbit: of objects not taking up space is because behind the scenes, these objects are actually written to file (.rds files, to be specific). You can see what exists under the hood with dir('.remake/objects/data')

And I was able to take a look at that same object referenced in https://github.com/collnell/ds-pipelines-2/issues/2 by using

readRDS('.remake/objects/data/0e8d236d17d49a764c3fe2aaef0d2491.rds')
$missing_data
[1] "grey90"

$plot_CRS
[1] "+init=epsg:2163"

$wfs
[1] "http://cida.usgs.gov/gdp/geoserver/wfs"

$feature
[1] "derivative:wbdhu8_alb_simp"

$countBins
 [1]    0    1    2    5   10   20   50  100  200  500 1000

(A lot funkier than accessing the data with scmake('map.config') instead, which is what we'd recommend).


:keyboard: Add a comment to this issue so we know you're ready to continue learning


I'll sit patiently until you've added a comment to this issue.

cnell-usgs commented 4 years ago

let's keep going!

github-learning-lab[bot] commented 4 years ago

More details on file targets

File targets are very flexible and of course, are also easy to share or store elsewhere.

Additionally, many targets are either language agnostic (e.g., csv, tsv, txt, nc files) or are meant to be shared across languages, such as the how the feather file was designed for exchange between R and Python.

When specifying a target in a remakefile recipe with file targets, the path to the file needs to be either absolute or relative to the working directory that the remake.yml file is in.


Most of the guidance you'd see on the remake package whould steer you away from using files as targets, since the benefits of files are quite small compared to the advantages of using objects. In fact, one of the edits I made to the background on target types that was borrowed from remake was to remove the statement "With remake though, [file targets] should probably only be the beginning or end points of an analysis", which is referring to the end products of a pipeline likely being figures, tables, markdown files, or documents (all files) and encouraging all other targets to be objects. For reasons that will become cleared in the future, we instaead recommend that files be used more liberally than objects because of two reasons: 1) ability to store data remotely in file format, and 2) ease of collaboration. You'll here more about this in the intermediate pipelines courses and when you see some more of the team's pipelines in practice.


:keyboard: Activity: Close this issue when you are ready to move on to the next assignment


I'll sit patiently until this issue is closed.

github-learning-lab[bot] commented 4 years ago


When you are done poking around, check out the next issue.