DOI-USGS / ds-pipelines-targets-3-course

Many-task pipelines using targets
Other
4 stars 6 forks source link

Splitters & branching #5

Closed lindsayplatt closed 3 years ago

lindsayplatt commented 3 years ago

Still trying to solve a problem where we have one target that contains data for all the states and then we need to use that to fetch data for each state in a branching set up. Stuck because adding states causes the pipeline to rebuild everything if it changes the alphabetic order.

A small, reproducible example. Here is what I currently have for _targets.R:

options(tidyverse.quiet = TRUE)
library(targets)
library(tarchetypes)
library(tidyverse)

list(
  # I tried using `sort()` but that didn't matter.
  tar_target(states, c('MI', 'MN')), # Step 1: Run with this target; works as expected
  # tar_target(states, c('MI', 'MN', 'PA')), # Step 2: Works as expected (MI/MN aren't rebuilt)
  # tar_target(states, c('MI', 'MN', 'PA', 'AL')), # Step 3: REBUILDS ALL OF THEM BC AL IS BUILT FIRST!

  tar_target(state_data, 
    as.tibble(state.x77) %>% 
      mutate(state_nm = rownames(state.x77), 
             state_abb = state.abb) %>% 
      relocate(state_nm, state_abb, .before = Population) %>% 
      filter(state.abb %in% states) %>% 
      # Set up to be a grouped target in the hopes that targets won't rebuild other groups
      group_by(state_abb) %>% 
      tar_group(),
    iteration = "group"
  ),

  tar_target(per_state_task, {
    message(sprintf("Fetching data for %s", unique(state_data$state_nm)))
    state_data$Area
  },
  pattern = map(state_data)
  )
)

Result when using tar_target(states, c('MI', 'MN')) ✔️

> tar_make()
* start target states
* built target states
* start target state_data
* built target state_data
* start branch per_state_task_681e3696
Fetching data for Michigan
* built branch per_state_task_681e3696
* start branch per_state_task_ea91aeb2
Fetching data for Minnesota
* built branch per_state_task_ea91aeb2
* built pattern per_state_task
* end pipeline

Result when using tar_target(states, c('MI', 'MN', 'PA')) ✔️

> tar_make()
* start target states
* built target states
* start target state_data
* built target state_data
v skip branch per_state_task_681e3696
v skip branch per_state_task_ea91aeb2
* start branch per_state_task_c1949ea9
Fetching data for Pennsylvania
* built branch per_state_task_c1949ea9
* built pattern per_state_task
* end pipeline

Result when using tar_target(states, c('MI', 'MN', 'PA', 'AL'))

> tar_make()
* start target states
* built target states
* start target state_data
* built target state_data
* start branch per_state_task_cf209554
Fetching data for Alabama
* built branch per_state_task_cf209554
* start branch per_state_task_94228602
Fetching data for Michigan
* built branch per_state_task_94228602
* start branch per_state_task_7e7176c2
Fetching data for Minnesota
* built branch per_state_task_7e7176c2
* start branch per_state_task_9912688e
Fetching data for Pennsylvania
* built branch per_state_task_9912688e
* built pattern per_state_task
* end pipeline
jzwart commented 3 years ago

I think it's because the tar_group changes when adding states that are earlier in the alphabet. When only building MI and MN, we get tar_groups of 1 and 2, respectively:

> tar_read(state_data)
# A tibble: 2 x 11
# Groups:   state_abb [2]
  state_nm  state_abb Population Income Illiteracy `Life Exp` Murder `HS Grad` Frost  Area tar_group
  <chr>     <chr>          <dbl>  <dbl>      <dbl>      <dbl>  <dbl>     <dbl> <dbl> <dbl>     <int>
1 Michigan  MI              9111   4751        0.9       70.6   11.1      52.8   125 56817         1
2 Minnesota MN              3921   4675        0.6       73.0    2.3      57.6   160 79289         2

MI and MN tar_groups stay the same when adding PA:

> tar_read(state_data)
# A tibble: 3 x 11
# Groups:   state_abb [3]
  state_nm     state_abb Population Income Illiteracy `Life Exp` Murder `HS Grad` Frost  Area tar_group
  <chr>        <chr>          <dbl>  <dbl>      <dbl>      <dbl>  <dbl>     <dbl> <dbl> <dbl>     <int>
1 Michigan     MI              9111   4751        0.9       70.6   11.1      52.8   125 56817         1
2 Minnesota    MN              3921   4675        0.6       73.0    2.3      57.6   160 79289         2
3 Pennsylvania PA             11860   4449        1         70.4    6.1      50.2   126 44966         3

But all are changed when adding AL:

> tar_read(state_data)
# A tibble: 4 x 11
# Groups:   state_abb [4]
  state_nm     state_abb Population Income Illiteracy `Life Exp` Murder `HS Grad` Frost  Area tar_group
  <chr>        <chr>          <dbl>  <dbl>      <dbl>      <dbl>  <dbl>     <dbl> <dbl> <dbl>     <int>
1 Alabama      AL              3615   3624        2.1       69.0   15.1      41.3    20 50708         1
2 Michigan     MI              9111   4751        0.9       70.6   11.1      52.8   125 56817         2
3 Minnesota    MN              3921   4675        0.6       73.0    2.3      57.6   160 79289         3
4 Pennsylvania PA             11860   4449        1         70.4    6.1      50.2   126 44966         4

I think the following is a partial solution. We need to group by some something other than the state abbreviation and by something that preserves the order of the states.

options(tidyverse.quiet = TRUE)
library(targets)
library(tarchetypes)
library(tidyverse)

list(
  # I tried using `sort()` but that didn't matter.
  tar_target(states, c('MI', 'MN')), # Step 1: Run with this target; works as expected
  # tar_target(states, c('MI', 'MN', 'PA')), # Step 2: Works as expected (MI/MN aren't rebuilt)
  # tar_target(states, c('MI', 'MN', 'PA', 'AL')), # Step 3: REBUILDS ALL OF THEM BC AL IS BUILT FIRST!

  # if we don't want previous targets to be rebuilt, then we need to make sure they're order always stays the same;
  # creating this new target that will preserve the order, assuming that the new states are appended onto the end of
  #  the states target.
  tar_target(state_order,
             tibble(state_abb = states,
                    order = seq(1,length(states)))),

  tar_target(state_data,
             as_tibble(state.x77) %>%
               mutate(state_nm = rownames(state.x77),
                      state_abb = state.abb) %>%
               relocate(state_nm, state_abb, .before = Population) %>%
               filter(state_abb %in% states) %>%
               left_join(state_order, by = 'state_abb') %>%
               # Set up to be a grouped target in the hopes that targets won't rebuild other groups
               group_by(order) %>%
               tar_group(),
             iteration = "group"
  ),

  tar_target(per_state_task, {
    message(sprintf("Fetching data for %s", state_data$state_nm))
    state_data$Area
  },
  pattern = map(state_data), iteration = 'list'
  )
)

Now, when building MI and MN, they are tar_groups 1 and 2, as before:

> tar_read(state_data)
# A tibble: 2 x 12
# Groups:   order [2]
  state_nm  state_abb Population Income Illiteracy `Life Exp` Murder `HS Grad` Frost  Area order tar_group
  <chr>     <chr>          <dbl>  <dbl>      <dbl>      <dbl>  <dbl>     <dbl> <dbl> <dbl> <int>     <int>
1 Michigan  MI              9111   4751        0.9       70.6   11.1      52.8   125 56817     1         1
2 Minnesota MN              3921   4675        0.6       73.0    2.3      57.6   160 79289     2         2

When adding PA, we get 1, 2, and 3 as before:

> tar_read(state_data)
# A tibble: 3 x 12
# Groups:   order [3]
  state_nm     state_abb Population Income Illiteracy `Life Exp` Murder `HS Grad` Frost  Area order tar_group
  <chr>        <chr>          <dbl>  <dbl>      <dbl>      <dbl>  <dbl>     <dbl> <dbl> <dbl> <int>     <int>
1 Michigan     MI              9111   4751        0.9       70.6   11.1      52.8   125 56817     1         1
2 Minnesota    MN              3921   4675        0.6       73.0    2.3      57.6   160 79289     2         2
3 Pennsylvania PA             11860   4449        1         70.4    6.1      50.2   126 44966     3         3

Now, when adding AL, we don't rebuild the previous targets because the tar_group hasn't changed for the previous 3 states:

> tar_make(); tar_read(state_data)
* start target states
* built target states
* start target state_order
* built target state_order
* start target state_data
* built target state_data
v skip branch per_state_task_8605e841
v skip branch per_state_task_d19b9c50
v skip branch per_state_task_61c02d68
* start branch per_state_task_96b5da98
Fetching data for Alabama
* built branch per_state_task_96b5da98
* built pattern per_state_task
* end pipeline
# A tibble: 4 x 12
# Groups:   order [4]
  state_nm     state_abb Population Income Illiteracy `Life Exp` Murder `HS Grad` Frost  Area order tar_group
  <chr>        <chr>          <dbl>  <dbl>      <dbl>      <dbl>  <dbl>     <dbl> <dbl> <dbl> <int>     <int>
1 Alabama      AL              3615   3624        2.1       69.0   15.1      41.3    20 50708     4         4
2 Michigan     MI              9111   4751        0.9       70.6   11.1      52.8   125 56817     1         1
3 Minnesota    MN              3921   4675        0.6       73.0    2.3      57.6   160 79289     2         2
4 Pennsylvania PA             11860   4449        1         70.4    6.1      50.2   126 44966     3         3
jzwart commented 3 years ago

The downside to this is that we have to remember to group_by something that preserves order and if we want to remove a state in the middle of the sequence, then the states following that state will be rebuilt. So this solution is only good for adding on new targets, not removing targets

lindsayplatt commented 3 years ago

Wow - thanks for digging deeper. I noticed this behavior a bit when looking at the hash added to the branches. Between running just MI & MN and then adding PA, the hashes next to MI & MN remain the same (I imagine because they are being assigned based on that tar_group column) BUT THEN when you add AL, the hash next to MI & MN are different, so they rebuild.

Mentioned this behavior to Jordan briefly earlier today. It makes me wonder if this is desired targets behavior or if this is worthy of an issue to the repo. To me, it seems like it is a problem and not behaving as one would expect.

jzwart commented 3 years ago

Yeah, I can't tell if it's an 'issue' - I think it might be behaving as intended, but it's not very useful for this situation? Like the targets are getting rebuilt when the hash or tar_group changes and skipped when they don't change. But it's really annoying that the data (e.g. state abbreviations) are not tracked and therefore get rebuilt unnecessarily. At the very least I think adding as a discussion would be good - https://github.com/ropensci/targets/discussions

jzwart commented 3 years ago

Actually, I think the following should work. states needs to be a list:

options(tidyverse.quiet = TRUE)
library(targets)
library(tarchetypes)
library(tidyverse)

list(
  # I tried using `sort()` but that didn't matter.
  tar_target(states, list('MI', 'MN')), # Step 1: Run with this target; works as expected
  # tar_target(states, c('MI', 'MN', 'PA')), # Step 2: Works as expected (MI/MN aren't rebuilt)
  # tar_target(states, c('MI', 'MN', 'PA', 'AL')), # Step 3: REBUILDS ALL OF THEM BC AL IS BUILT FIRST!

  tar_target(state_data,
             as_tibble(state.x77) %>%
               mutate(state_nm = rownames(state.x77),
                      state_abb = state.abb) %>%
               relocate(state_nm, state_abb, .before = Population) %>%
               filter(state_abb %in% states[[1]]),
             pattern = map(states),
             iteration = "list"
  ),

  tar_target(per_state_task, {
    message(sprintf("Fetching data for %s", state_data$state_nm))
    state_data$Area
  },
  pattern = map(state_data), iteration = 'list'
  )
)

seems to work for both adding new states and removing states in the list

jordansread commented 3 years ago

ahh, that's because order is retained in a list but not a vector? That seems like a bug if one data type is sorted internally and the other is not.

lindsayplatt commented 3 years ago

Huh wow, that did work but I removed the [[1]] next to filter(state_abb %in% states[[1]]).

Interesting though that there wouldn't be a way to use this approach with the group_by() + tar_group() approach since that seems to be one of the preferred methods for working with data.frames.

jzwart commented 3 years ago

that's because order is retained in a list but not a vector?

I think it's because the group_by iteration is pointing to the tar_group, whose data are changing based on the states supplied, while the list iteration is creating a new state_data target if it hasn't seen the state abbreviation before and these data do not change for a given state.

but now I'm second guessing myself...

lindsayplatt commented 3 years ago

Dug a bit more and it is actually not changing c() to list(). It appears to be more to do with using iteration='list' rather than iteration='group'. I did start a discussion based on these findings in targets here: https://github.com/ropensci/targets/discussions/507

lindsayplatt commented 3 years ago

BIG NEWS! Finally overcame the branching bug. It was first, truly a bug that was fixed in targets per my open discussion linked above (!!!) but then replacing my dummy example state_data target with code that used the function find_oldest_sites() to query NWIS was still causing rebuilds every time (not just due to alphabetic issues).

I finally realized that it is because that step is re-inventorying all the states when a new one is added to the states target (as coded). BUT because dataRetrieval returns attributes to the data that mark the query time, targets marked all inventories as different every time. When I add these the lines below to the function that fetches individual state queries, the branching works as expected!

# Remove attributes that include query times, otherwise `targets` will think it changed
  # when it rebuilds when a new state is added since the inventory builds all states again
  # using `find_oldest_sites()`
  attr(sites, "comment") <- NULL
  attr(sites, "queryTime") <- NULL
  attr(sites, "headerInfo") <- NULL

Can't believe it took me this long 🤯

aappling-usgs commented 3 years ago

select() or other column selection mechanisms drop attributes. the old scipiper training code has a call to select() that I bet was taking care of this issue for us:

> df <- data.frame(x=1:3, y=letters[1:3])
> attr(df, 'datetime') <- Sys.time()

> str(df)
'data.frame':    3 obs. of  2 variables:
  $ x: int  1 2 3
  $ y: chr  "a" "b" "c"
  - attr(*, "datetime")= POSIXct[1:1], format: "2021-06-25 09:08:39"

> df %>% select(x,y) %>% str
'data.frame':    3 obs. of  2 variables:
  $ x: int  1 2 3
  $ y: chr  "a" "b" "c"

> df[,c('x','y')] %>% str
'data.frame':    3 obs. of  2 variables:
  $ x: int  1 2 3
  $ y: chr  "a" "b" "c"
lindsayplatt commented 3 years ago

Okayyyyyyyy where do I start? At one point during my testing I added a harmless as_tibble() to the end of the whatNWISdata() call. THIS is what is actually causing the rebuilds. I happened to delete it at the same time I was testing the attr() theory (rookie mistake) and thus that that I had a break through. I don't yet have a perfectly reproducible example to show (working on that), but sending a self-contained _targets.R to illustrate the issue.

The _targets.R file

options(tidyverse.quiet = TRUE)

library(targets)
library(tidyverse)

tar_option_set(packages = c("tidyverse"))

find_oldest_site <- function(state) {
  message(sprintf('  Inventorying sites in %s', state))
  sites <- dataRetrieval::whatNWISdata(
    parameterCd='00060', stateCd=state, hasDataType='dv', agencyCd='USGS',
    siteType='ST', siteStatus='active',
    startDt=format(Sys.Date()-as.difftime(7, units='days'), '%Y-%m-%d'),
    endDt=format(Sys.Date()-as.difftime(1, units='days'), '%Y-%m-%d'))

  # UNCOMMENT ME TO SEE HOW REBUILDS KICK OFF FOR WI WHEN THEY SHOULDN'T
  # sites <- as_tibble(sites) 

  best_site <- sites %>%
    filter(stat_cd == '00003', data_type_cd == 'dv') %>%
    filter(count_nu == max(count_nu)) %>%
    mutate(state_cd = state, alt_acy_va = as.character(alt_acy_va)) %>%
    select(state_cd, site_no, station_nm, dec_lat_va, dec_long_va, 
           dec_coord_datum_cd, begin_date, end_date, count_nu) %>%
    slice(1) # OK has two sites tied for first; just take one
  return(best_site)
}

list(
  tar_target(states1, sort(c('WI'))),
  tar_target(states2, sort(c('WI', 'MN'))),

  tar_target(states, states1), # SWITCH THIS OUT BETWEEN `tar_make()`

  # target that gets all state data
  tar_target(all_state_data,
             purrr::map_df(states, find_oldest_site) %>%
               group_by(state_cd) %>%
               tar_group(),
             iteration = "group"),

  # target that uses each state's data to do a long internet fetch
  tar_target(nwis_data,
             {
               message(sprintf("Fetching data for %s", unique(all_state_data$state_cd)))
               return(all_state_data$state_cd)
             },
             pattern = map(all_state_data))

)

Results when running

tar_make() with states1 and no as_tibble()

WI is inventoried AND fetched as expected

> tar_make() # With `states1` and no `as_tibble()`
* start target states1
* built target states1
* start target states2
* built target states2
* start target states
* built target states
* start target all_state_data
  Inventorying sites in WI
* built target all_state_data
* start branch nwis_data_977127fc
Fetching data for WI
* built branch nwis_data_977127fc
* built pattern nwis_data
* end pipeline

tar_make() with states2 and no as_tibble()

WI is inventoried BUT NOT fetched as expected MN is inventoried AND fetched as expected

> tar_make() # With `states2` and no `as_tibble()`
v skip target states2
v skip target states1
* start target states
* built target states
* start target all_state_data
  Inventorying sites in MN
  Inventorying sites in WI
* built target all_state_data
* start branch nwis_data_d28331e8
Fetching data for MN
* built branch nwis_data_d28331e8
v skip branch nwis_data_977127fc
* built pattern nwis_data
* end pipeline

tar_make() with states1 and includes as_tibble() (run tar_destroy() to reset the build)

WI is inventoried AND fetched as expected

> tar_make() # With `states1` and includes `as_tibble()`
* start target states1
* built target states1
* start target states2
* built target states2
* start target states
* built target states
* start target all_state_data
  Inventorying sites in WI
* built target all_state_data
* start branch nwis_data_c75112fe
Fetching data for WI
* built branch nwis_data_c75112fe
* built pattern nwis_data
* end pipeline

tar_make() with states2 and includes as_tibble()

WI is inventoried AND fetched NOT as expected (WRONG!) MN is inventoried AND fetched as expected

> tar_make() # With `states2` and includes `as_tibble()`
v skip target states2
v skip target states1
* start target states
* built target states
* start target all_state_data
  Inventorying sites in MN
  Inventorying sites in WI
* built target all_state_data
* start branch nwis_data_4024695c
Fetching data for MN
* built branch nwis_data_4024695c
* start branch nwis_data_c3e7d1df
Fetching data for WI
* built branch nwis_data_c3e7d1df
* built pattern nwis_data
* end pipeline
aappling-usgs commented 3 years ago

I think it's still an attributes issue. Attributes appear to intersect with as_tibble(), where calling as_tibble() preserves attributes through the filter, mutate, select, and slice operations that follow in your example (where I'll still tentatively pin the blame on select for now based on my earlier exploration, though multiple operations might be responsible).

Without as_tibble(), attributes get discarded.

> tar_make() # states1, no as_tibble()
● start target states1
● built target states1
● start target states2
● built target states2
● start target states
● built target states
● start target all_state_data
  Inventorying sites in WI
● built target all_state_data
● start branch nwis_data_f21055a5
Fetching data for WI
● built branch nwis_data_f21055a5
● built pattern nwis_data
● end pipeline

> str(tar_read(all_state_data))
tibble [1 × 10] (S3: tbl_df/tbl/data.frame)
 $ state_cd          : chr "WI"
 $ site_no           : chr "04073500"
 $ station_nm        : chr "FOX RIVER AT BERLIN, WI"
 $ dec_lat_va        : num 44
 $ dec_long_va       : num -89
 $ dec_coord_datum_cd: chr "NAD83"
 $ begin_date        : Date[1:1], format: "1898-01-01"
 $ end_date          : Date[1:1], format: "2021-06-24"
 $ count_nu          : num 45096
 $ tar_group         : int 1

With as_tibble(), attributes persist.

> tar_make() # states1, yes as_tibble()
✓ skip target states1
✓ skip target states2
✓ skip target states
● start target all_state_data
  Inventorying sites in WI
● built target all_state_data
● start branch nwis_data_f9909b60
Fetching data for WI
● built branch nwis_data_f9909b60
● built pattern nwis_data
● end pipeline

> str(tar_read(all_state_data))
tibble [1 × 10] (S3: tbl_df/tbl/data.frame)
 $ state_cd          : chr "WI"
 $ site_no           : chr "04073500"
 $ station_nm        : chr "FOX RIVER AT BERLIN, WI"
 $ dec_lat_va        : num 44
 $ dec_long_va       : num -89
 $ dec_coord_datum_cd: chr "NAD83"
 $ begin_date        : Date[1:1], format: "1898-01-01"
 $ end_date          : Date[1:1], format: "2021-06-24"
 $ count_nu          : num 45096
 $ tar_group         : int 1
 - attr(*, "comment")= chr [1:42] "#" "#" "# US Geological Survey" "# retrieved: 2021-06-25 13:12:47 -04:00\t(caas01)" ...
 - attr(*, "queryTime")= POSIXct[1:1], format: "2021-06-25 13:12:48"
 - attr(*, "url")= chr "https://waterservices.usgs.gov/nwis/site/?seriesCatalogOutput=true&parameterCd=00060&stateCd=WI&hasDataType=dv&"| __truncated__
 - attr(*, "headerInfo")=List of 12
  ..$ date                       : chr "Fri, 25 Jun 2021 17:12:47 GMT"
  ..$ server                     : chr "Apache-Coyote/1.1"
  ..$ strict-transport-security  : chr "max-age=31536000"
  ..$ vary                       : chr "Accept-Encoding"
  ..$ content-encoding           : chr "gzip"
  ..$ content-type               : chr "text/plain;charset=UTF-8"
  ..$ cache-control              : chr "max-age=900"
  ..$ expires                    : chr "Fri, 25 Jun 2021 17:27:47 GMT"
  ..$ x-ua-compatible            : chr "IE=edge,chrome=1"
  ..$ access-control-allow-origin: chr "*"
  ..$ x-frame-options            : chr "deny"
  ..$ transfer-encoding          : chr "chunked"
  ..- attr(*, "class")= chr [1:2] "insensitive" "list"
lindsayplatt commented 3 years ago

Oh yes, here is a really simple example of that:

df <- data.frame(x = 1:6, y = letters[1:6], stringsAsFactors = FALSE)
attr(df, "testattr") <- "Hello, I am an attribute"

df %>% 
  as_tibble() %>%
  select(x) %>% 
  str()
tibble [6 x 1] (S3: tbl_df/tbl/data.frame)
 $ x: int [1:6] 1 2 3 4 5 6
 - attr(*, "testattr")= chr "Hello, I am an attribute"

df %>% 
  select(x) %>% 
  str()
'data.frame':   6 obs. of  1 variable:
 $ x: int  1 2 3 4 5 6
lindsayplatt commented 3 years ago

So because find_oldest_site() had a select() call later in the function, doing as_tibble() before meant that select didn't cause the attributes to drop. It is weird to me that some dplyr verbs drop attributes and some don't. filter() doesn't have this affect:

df %>% 
  filter(x != 3) %>% 
  str()

'data.frame':   5 obs. of  2 variables:
 $ x: int  1 2 4 5 6
 $ y: chr  "a" "b" "d" "e" ...
 - attr(*, "testattr")= chr "Hello, I am an attribute"
aappling-usgs commented 3 years ago

Yes, I was also surprised to find that only select does it (of the verbs I've tried so far, anyway). I think explicitly dropping attributes may be a good idea anyway because then we'll have a visible reminder that dataRetrieval includes attributes. Or at least a comment about how you're achieving the dropping of attributes.

lindsayplatt commented 3 years ago

I guess I just use select() so often that I never noticed this impact. I might think about adding to the "tips and tricks" (Pipelines II) training somewhere.

lindsayplatt commented 3 years ago

Made this obvious in the template repo via https://github.com/USGS-R/ds-pipelines-targets-3-template/commit/bde19d7fc9070925a71589ea04ab26934adb07b5

lindsayplatt commented 3 years ago

Calling this done-zo since the actual issues have been identified and fixed/coded around.