targets integration - Githubissues

MarkEdmondson1234 commented 2 years ago

Getting some feedback here https://github.com/ropensci/targets/issues/720

GCP already available via:

library(future)
library(targets)
library(googleComputeEngineR)

vms <- gce_vm_cluster()
plan <- plan(cluster, workers = as.cluster(vms))
tar_resources_future(plan = plan)
...

But I think there is an opportunity to move this more into a serverless direction, as the cloud build steps seem to seamlessly map to tar_targets() if a way of communicating between the steps can be done.

As an example an equivalent googleCloudRunner to targets minimal example would be:

library(googleCloudRunner)

bs <- c(
    cr_buildstep_gcloud("gsutil", 
                        id = "raw_data_file",
                        args = c("gsutil",
                                 "cp",
                                 "gs://your-bucket/data/raw_data.csv",
                                 "/workspace/data/raw_data.csv")),
    # normally would not use readRDS()/saveRDS() in multiple steps but for sake of example
    cr_buildstep_r("read_csv('/workspace/data/raw_data.csv', col_types = cols()) %>% saveRDS('raw_data')",
                   id = "raw_data",
                   name = "verse"),
    cr_buildstep_r("readRDS('raw_data') %>% filter(!is.na(Ozone)) %>% saveRDS('data')",
                   id = "data",
                   name = "verse"),
    cr_buildstep_r("create_plot(readRDS('data')) %>% saveRDS('hist')",
                   id = "hist", 
                   waitFor = "data", # so it runs concurrently to 'fit'
                   name = "verse"),
    cr_buildstep_r("biglm(Ozone ~ Wind + Temp, readRDS('data'))",
                   waitFor = "data", # so it runs concurrently to 'hist'
                   id = "fit",
                   name = "gcr.io/mydocker/biglm")                          

)
bs |> cr_build_yaml()

Normally I would put all the r steps in one buildstep sourced from a file but have added readRDS() %>% blah() %>% saveRDS() to illustrate functionality that I think targets could take care of.

Makes this yaml object that I think maps to targets closely:

==cloudRunnerYaml==
steps:
- name: gcr.io/google.com/cloudsdktool/cloud-sdk:alpine
  entrypoint: gsutil
  args:
  - gsutil
  - cp
  - gs://your-bucket/data/raw_data.csv
  - /workspace/data/raw_data.csv
  id: raw_data_file
- name: rocker/verse
  args:
  - Rscript
  - -e
  - read_csv('/workspace/data/raw_data.csv', col_types = cols()) %>% saveRDS('raw_data')
  id: raw_data
- name: rocker/verse
  args:
  - Rscript
  - -e
  - readRDS('raw_data') %>% filter(!is.na(Ozone)) %>% saveRDS('data')
  id: data
- name: rocker/verse
  args:
  - Rscript
  - -e
  - create_plot(readRDS('data')) %>% saveRDS('hist')
  id: hist
  waitFor:
  - data
- name: gcr.io/mydocker/biglm
  args:
  - Rscript
  - -e
  - biglm(Ozone ~ Wind + Temp, readRDS('data'))
  id: fit
  waitFor:
  - data

(more build args here)

Do the build on GCP via the_build |> cr_build()

And/or each buildstep could be its own dedicated cr_build() and the build's artefacts are uploaded/downloaded after its run.

This holds several advantages:

Each step can be executed in its own environment
Each step can use differing amount of resources (e.g. a 32 core build step vs a 1 core)
Start-up and tear down is handled automatically
Multiple languages could be used within a task step
Up to 24hrs compute time per step
Default 30 steps concurrent usage, quotas up to 100. Unlimited build queue.

I see that as a tool that is better than Airflow for visualising DAGs, taking care of state management on whether each node needs to be run but with a lot of scale to build each step in a cloud environment.

MarkEdmondson1234 commented 2 years ago

The function cr_build_targets() helps set up some boilerplate code to download targets meta data from the specified GCS bucket, run the pipeline and uplaod the artifacts back to the same bucket. Need some tests to see if it is respecting the right targets skips etc.

cr_build_targets(path=tempfile())

# adding custom environment args and secrets to the build
cr_build_targets(
  task_image = "gcr.io/my-project/my-targets-pipeline",
  options = list(env = c("ENV1=1234",
                         "ENV_USER=Dave")),
  availableSecrets = cr_build_yaml_secrets("MY_PW","my-pw"),
  task_args = list(secretEnv = "MY_PW"))

Resulting in build:

==cloudRunnerYaml==
steps:
- name: gcr.io/google.com/cloudsdktool/cloud-sdk:alpine
  entrypoint: bash
  args:
  - -c
  - gsutil -m cp -r ${_TARGET_BUCKET}/* /workspace/_targets || exit 0
  id: get previous _targets metadata
- name: ubuntu
  args:
  - bash
  - -c
  - ls -lR
  id: debug file list
- name: gcr.io/my-project/my-targets-pipeline
  args:
  - Rscript
  - -e
  - targets::tar_make()
  id: target pipeline
  secretEnv:
  - MY_PW
timeout: 3600s
options:
  env:
  - ENV1=1234
  - ENV_USER=Dave
substitutions:
  _TARGET_BUCKET: gs://mark-edmondson-public-files/googleCloudRunner/_targets
availableSecrets:
  secretManager:
  - versionName: projects/mark-edmondson-gde/secrets/my-pw/versions/latest
    env: MY_PW
artifacts:
  objects:
    location: gs://mark-edmondson-public-files/googleCloudRunner/_targets/meta
    paths:
    - /workspace/_targets/meta/**

MarkEdmondson1234 commented 2 years ago

Tests are working now which confirm a targets build can reuse previous builds artifacts, and also rerun if the source are updates https://github.com/MarkEdmondson1234/googleCloudRunner/pull/159/files

MarkEdmondson1234 commented 2 years ago

Need two modes(?) - one where all target files are the upcoming gcs integration which will download artifacts as needed, one where the data is loaded from other sources (file etc) kept in a normal GCS bucket

MarkEdmondson1234 commented 2 years ago

Added cr_buildstep_targets() to prep for sending up individual build steps. cr_buildstep_targets_setup() downloads the meta folder, cr_buildstep_targets_teardown() uploads the targets changed files to the bucket.

MarkEdmondson1234 / googleCloudRunner

targets integration #155