Closed lindsayplatt closed 3 years ago
I think it's because the tar_group
changes when adding states that are earlier in the alphabet. When only building MI
and MN
, we get tar_groups
of 1
and 2
, respectively:
> tar_read(state_data)
# A tibble: 2 x 11
# Groups: state_abb [2]
state_nm state_abb Population Income Illiteracy `Life Exp` Murder `HS Grad` Frost Area tar_group
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 Michigan MI 9111 4751 0.9 70.6 11.1 52.8 125 56817 1
2 Minnesota MN 3921 4675 0.6 73.0 2.3 57.6 160 79289 2
MI
and MN
tar_groups
stay the same when adding PA
:
> tar_read(state_data)
# A tibble: 3 x 11
# Groups: state_abb [3]
state_nm state_abb Population Income Illiteracy `Life Exp` Murder `HS Grad` Frost Area tar_group
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 Michigan MI 9111 4751 0.9 70.6 11.1 52.8 125 56817 1
2 Minnesota MN 3921 4675 0.6 73.0 2.3 57.6 160 79289 2
3 Pennsylvania PA 11860 4449 1 70.4 6.1 50.2 126 44966 3
But all are changed when adding AL
:
> tar_read(state_data)
# A tibble: 4 x 11
# Groups: state_abb [4]
state_nm state_abb Population Income Illiteracy `Life Exp` Murder `HS Grad` Frost Area tar_group
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 Alabama AL 3615 3624 2.1 69.0 15.1 41.3 20 50708 1
2 Michigan MI 9111 4751 0.9 70.6 11.1 52.8 125 56817 2
3 Minnesota MN 3921 4675 0.6 73.0 2.3 57.6 160 79289 3
4 Pennsylvania PA 11860 4449 1 70.4 6.1 50.2 126 44966 4
I think the following is a partial solution. We need to group by some something other than the state abbreviation and by something that preserves the order of the states.
options(tidyverse.quiet = TRUE)
library(targets)
library(tarchetypes)
library(tidyverse)
list(
# I tried using `sort()` but that didn't matter.
tar_target(states, c('MI', 'MN')), # Step 1: Run with this target; works as expected
# tar_target(states, c('MI', 'MN', 'PA')), # Step 2: Works as expected (MI/MN aren't rebuilt)
# tar_target(states, c('MI', 'MN', 'PA', 'AL')), # Step 3: REBUILDS ALL OF THEM BC AL IS BUILT FIRST!
# if we don't want previous targets to be rebuilt, then we need to make sure they're order always stays the same;
# creating this new target that will preserve the order, assuming that the new states are appended onto the end of
# the states target.
tar_target(state_order,
tibble(state_abb = states,
order = seq(1,length(states)))),
tar_target(state_data,
as_tibble(state.x77) %>%
mutate(state_nm = rownames(state.x77),
state_abb = state.abb) %>%
relocate(state_nm, state_abb, .before = Population) %>%
filter(state_abb %in% states) %>%
left_join(state_order, by = 'state_abb') %>%
# Set up to be a grouped target in the hopes that targets won't rebuild other groups
group_by(order) %>%
tar_group(),
iteration = "group"
),
tar_target(per_state_task, {
message(sprintf("Fetching data for %s", state_data$state_nm))
state_data$Area
},
pattern = map(state_data), iteration = 'list'
)
)
Now, when building MI
and MN
, they are tar_groups
1
and 2
, as before:
> tar_read(state_data)
# A tibble: 2 x 12
# Groups: order [2]
state_nm state_abb Population Income Illiteracy `Life Exp` Murder `HS Grad` Frost Area order tar_group
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 Michigan MI 9111 4751 0.9 70.6 11.1 52.8 125 56817 1 1
2 Minnesota MN 3921 4675 0.6 73.0 2.3 57.6 160 79289 2 2
When adding PA
, we get 1
, 2
, and 3
as before:
> tar_read(state_data)
# A tibble: 3 x 12
# Groups: order [3]
state_nm state_abb Population Income Illiteracy `Life Exp` Murder `HS Grad` Frost Area order tar_group
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 Michigan MI 9111 4751 0.9 70.6 11.1 52.8 125 56817 1 1
2 Minnesota MN 3921 4675 0.6 73.0 2.3 57.6 160 79289 2 2
3 Pennsylvania PA 11860 4449 1 70.4 6.1 50.2 126 44966 3 3
Now, when adding AL
, we don't rebuild the previous targets because the tar_group
hasn't changed for the previous 3 states:
> tar_make(); tar_read(state_data)
* start target states
* built target states
* start target state_order
* built target state_order
* start target state_data
* built target state_data
v skip branch per_state_task_8605e841
v skip branch per_state_task_d19b9c50
v skip branch per_state_task_61c02d68
* start branch per_state_task_96b5da98
Fetching data for Alabama
* built branch per_state_task_96b5da98
* built pattern per_state_task
* end pipeline
# A tibble: 4 x 12
# Groups: order [4]
state_nm state_abb Population Income Illiteracy `Life Exp` Murder `HS Grad` Frost Area order tar_group
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 Alabama AL 3615 3624 2.1 69.0 15.1 41.3 20 50708 4 4
2 Michigan MI 9111 4751 0.9 70.6 11.1 52.8 125 56817 1 1
3 Minnesota MN 3921 4675 0.6 73.0 2.3 57.6 160 79289 2 2
4 Pennsylvania PA 11860 4449 1 70.4 6.1 50.2 126 44966 3 3
The downside to this is that we have to remember to group_by something that preserves order and if we want to remove a state in the middle of the sequence, then the states following that state will be rebuilt. So this solution is only good for adding on new targets, not removing targets
Wow - thanks for digging deeper. I noticed this behavior a bit when looking at the hash added to the branches. Between running just MI & MN and then adding PA, the hashes next to MI & MN remain the same (I imagine because they are being assigned based on that tar_group
column) BUT THEN when you add AL, the hash next to MI & MN are different, so they rebuild.
Mentioned this behavior to Jordan briefly earlier today. It makes me wonder if this is desired targets
behavior or if this is worthy of an issue to the repo. To me, it seems like it is a problem and not behaving as one would expect.
Yeah, I can't tell if it's an 'issue' - I think it might be behaving as intended, but it's not very useful for this situation? Like the targets are getting rebuilt when the hash or tar_group changes and skipped when they don't change. But it's really annoying that the data (e.g. state abbreviations) are not tracked and therefore get rebuilt unnecessarily. At the very least I think adding as a discussion would be good - https://github.com/ropensci/targets/discussions
Actually, I think the following should work. states
needs to be a list:
options(tidyverse.quiet = TRUE)
library(targets)
library(tarchetypes)
library(tidyverse)
list(
# I tried using `sort()` but that didn't matter.
tar_target(states, list('MI', 'MN')), # Step 1: Run with this target; works as expected
# tar_target(states, c('MI', 'MN', 'PA')), # Step 2: Works as expected (MI/MN aren't rebuilt)
# tar_target(states, c('MI', 'MN', 'PA', 'AL')), # Step 3: REBUILDS ALL OF THEM BC AL IS BUILT FIRST!
tar_target(state_data,
as_tibble(state.x77) %>%
mutate(state_nm = rownames(state.x77),
state_abb = state.abb) %>%
relocate(state_nm, state_abb, .before = Population) %>%
filter(state_abb %in% states[[1]]),
pattern = map(states),
iteration = "list"
),
tar_target(per_state_task, {
message(sprintf("Fetching data for %s", state_data$state_nm))
state_data$Area
},
pattern = map(state_data), iteration = 'list'
)
)
seems to work for both adding new states and removing states in the list
ahh, that's because order is retained in a list but not a vector? That seems like a bug if one data type is sorted internally and the other is not.
Huh wow, that did work but I removed the [[1]]
next to filter(state_abb %in% states[[1]])
.
Interesting though that there wouldn't be a way to use this approach with the group_by() + tar_group() approach since that seems to be one of the preferred methods for working with data.frames.
that's because order is retained in a list but not a vector?
I think it's because the group_by iteration is pointing to the tar_group
, whose data are changing based on the states supplied, while the list iteration is creating a new state_data target if it hasn't seen the state abbreviation before and these data do not change for a given state.
but now I'm second guessing myself...
Dug a bit more and it is actually not changing c()
to list()
. It appears to be more to do with using iteration='list'
rather than iteration='group'
. I did start a discussion based on these findings in targets
here: https://github.com/ropensci/targets/discussions/507
BIG NEWS! Finally overcame the branching bug. It was first, truly a bug that was fixed in targets per my open discussion linked above (!!!) but then replacing my dummy example state_data
target with code that used the function find_oldest_sites()
to query NWIS was still causing rebuilds every time (not just due to alphabetic issues).
I finally realized that it is because that step is re-inventorying all the states when a new one is added to the states
target (as coded). BUT because dataRetrieval
returns attributes to the data that mark the query time, targets
marked all inventories as different every time. When I add these the lines below to the function that fetches individual state queries, the branching works as expected!
# Remove attributes that include query times, otherwise `targets` will think it changed
# when it rebuilds when a new state is added since the inventory builds all states again
# using `find_oldest_sites()`
attr(sites, "comment") <- NULL
attr(sites, "queryTime") <- NULL
attr(sites, "headerInfo") <- NULL
Can't believe it took me this long 🤯
select()
or other column selection mechanisms drop attributes. the old scipiper training code has a call to select()
that I bet was taking care of this issue for us:
> df <- data.frame(x=1:3, y=letters[1:3])
> attr(df, 'datetime') <- Sys.time()
> str(df)
'data.frame': 3 obs. of 2 variables:
$ x: int 1 2 3
$ y: chr "a" "b" "c"
- attr(*, "datetime")= POSIXct[1:1], format: "2021-06-25 09:08:39"
> df %>% select(x,y) %>% str
'data.frame': 3 obs. of 2 variables:
$ x: int 1 2 3
$ y: chr "a" "b" "c"
> df[,c('x','y')] %>% str
'data.frame': 3 obs. of 2 variables:
$ x: int 1 2 3
$ y: chr "a" "b" "c"
Okayyyyyyyy where do I start? At one point during my testing I added a harmless as_tibble()
to the end of the whatNWISdata()
call. THIS is what is actually causing the rebuilds. I happened to delete it at the same time I was testing the attr()
theory (rookie mistake) and thus that that I had a break through. I don't yet have a perfectly reproducible example to show (working on that), but sending a self-contained _targets.R
to illustrate the issue.
_targets.R
fileoptions(tidyverse.quiet = TRUE)
library(targets)
library(tidyverse)
tar_option_set(packages = c("tidyverse"))
find_oldest_site <- function(state) {
message(sprintf(' Inventorying sites in %s', state))
sites <- dataRetrieval::whatNWISdata(
parameterCd='00060', stateCd=state, hasDataType='dv', agencyCd='USGS',
siteType='ST', siteStatus='active',
startDt=format(Sys.Date()-as.difftime(7, units='days'), '%Y-%m-%d'),
endDt=format(Sys.Date()-as.difftime(1, units='days'), '%Y-%m-%d'))
# UNCOMMENT ME TO SEE HOW REBUILDS KICK OFF FOR WI WHEN THEY SHOULDN'T
# sites <- as_tibble(sites)
best_site <- sites %>%
filter(stat_cd == '00003', data_type_cd == 'dv') %>%
filter(count_nu == max(count_nu)) %>%
mutate(state_cd = state, alt_acy_va = as.character(alt_acy_va)) %>%
select(state_cd, site_no, station_nm, dec_lat_va, dec_long_va,
dec_coord_datum_cd, begin_date, end_date, count_nu) %>%
slice(1) # OK has two sites tied for first; just take one
return(best_site)
}
list(
tar_target(states1, sort(c('WI'))),
tar_target(states2, sort(c('WI', 'MN'))),
tar_target(states, states1), # SWITCH THIS OUT BETWEEN `tar_make()`
# target that gets all state data
tar_target(all_state_data,
purrr::map_df(states, find_oldest_site) %>%
group_by(state_cd) %>%
tar_group(),
iteration = "group"),
# target that uses each state's data to do a long internet fetch
tar_target(nwis_data,
{
message(sprintf("Fetching data for %s", unique(all_state_data$state_cd)))
return(all_state_data$state_cd)
},
pattern = map(all_state_data))
)
states1
and no as_tibble()
WI is inventoried AND fetched as expected
> tar_make() # With `states1` and no `as_tibble()`
* start target states1
* built target states1
* start target states2
* built target states2
* start target states
* built target states
* start target all_state_data
Inventorying sites in WI
* built target all_state_data
* start branch nwis_data_977127fc
Fetching data for WI
* built branch nwis_data_977127fc
* built pattern nwis_data
* end pipeline
states2
and no as_tibble()
WI is inventoried BUT NOT fetched as expected MN is inventoried AND fetched as expected
> tar_make() # With `states2` and no `as_tibble()`
v skip target states2
v skip target states1
* start target states
* built target states
* start target all_state_data
Inventorying sites in MN
Inventorying sites in WI
* built target all_state_data
* start branch nwis_data_d28331e8
Fetching data for MN
* built branch nwis_data_d28331e8
v skip branch nwis_data_977127fc
* built pattern nwis_data
* end pipeline
states1
and includes as_tibble()
(run tar_destroy()
to reset the build)WI is inventoried AND fetched as expected
> tar_make() # With `states1` and includes `as_tibble()`
* start target states1
* built target states1
* start target states2
* built target states2
* start target states
* built target states
* start target all_state_data
Inventorying sites in WI
* built target all_state_data
* start branch nwis_data_c75112fe
Fetching data for WI
* built branch nwis_data_c75112fe
* built pattern nwis_data
* end pipeline
states2
and includes as_tibble()
WI is inventoried AND fetched NOT as expected (WRONG!) MN is inventoried AND fetched as expected
> tar_make() # With `states2` and includes `as_tibble()`
v skip target states2
v skip target states1
* start target states
* built target states
* start target all_state_data
Inventorying sites in MN
Inventorying sites in WI
* built target all_state_data
* start branch nwis_data_4024695c
Fetching data for MN
* built branch nwis_data_4024695c
* start branch nwis_data_c3e7d1df
Fetching data for WI
* built branch nwis_data_c3e7d1df
* built pattern nwis_data
* end pipeline
I think it's still an attributes issue. Attributes appear to intersect with as_tibble()
, where calling as_tibble()
preserves attributes through the filter
, mutate
, select
, and slice
operations that follow in your example (where I'll still tentatively pin the blame on select
for now based on my earlier exploration, though multiple operations might be responsible).
Without as_tibble()
, attributes get discarded.
> tar_make() # states1, no as_tibble()
● start target states1
● built target states1
● start target states2
● built target states2
● start target states
● built target states
● start target all_state_data
Inventorying sites in WI
● built target all_state_data
● start branch nwis_data_f21055a5
Fetching data for WI
● built branch nwis_data_f21055a5
● built pattern nwis_data
● end pipeline
> str(tar_read(all_state_data))
tibble [1 × 10] (S3: tbl_df/tbl/data.frame)
$ state_cd : chr "WI"
$ site_no : chr "04073500"
$ station_nm : chr "FOX RIVER AT BERLIN, WI"
$ dec_lat_va : num 44
$ dec_long_va : num -89
$ dec_coord_datum_cd: chr "NAD83"
$ begin_date : Date[1:1], format: "1898-01-01"
$ end_date : Date[1:1], format: "2021-06-24"
$ count_nu : num 45096
$ tar_group : int 1
With as_tibble()
, attributes persist.
> tar_make() # states1, yes as_tibble()
✓ skip target states1
✓ skip target states2
✓ skip target states
● start target all_state_data
Inventorying sites in WI
● built target all_state_data
● start branch nwis_data_f9909b60
Fetching data for WI
● built branch nwis_data_f9909b60
● built pattern nwis_data
● end pipeline
> str(tar_read(all_state_data))
tibble [1 × 10] (S3: tbl_df/tbl/data.frame)
$ state_cd : chr "WI"
$ site_no : chr "04073500"
$ station_nm : chr "FOX RIVER AT BERLIN, WI"
$ dec_lat_va : num 44
$ dec_long_va : num -89
$ dec_coord_datum_cd: chr "NAD83"
$ begin_date : Date[1:1], format: "1898-01-01"
$ end_date : Date[1:1], format: "2021-06-24"
$ count_nu : num 45096
$ tar_group : int 1
- attr(*, "comment")= chr [1:42] "#" "#" "# US Geological Survey" "# retrieved: 2021-06-25 13:12:47 -04:00\t(caas01)" ...
- attr(*, "queryTime")= POSIXct[1:1], format: "2021-06-25 13:12:48"
- attr(*, "url")= chr "https://waterservices.usgs.gov/nwis/site/?seriesCatalogOutput=true¶meterCd=00060&stateCd=WI&hasDataType=dv&"| __truncated__
- attr(*, "headerInfo")=List of 12
..$ date : chr "Fri, 25 Jun 2021 17:12:47 GMT"
..$ server : chr "Apache-Coyote/1.1"
..$ strict-transport-security : chr "max-age=31536000"
..$ vary : chr "Accept-Encoding"
..$ content-encoding : chr "gzip"
..$ content-type : chr "text/plain;charset=UTF-8"
..$ cache-control : chr "max-age=900"
..$ expires : chr "Fri, 25 Jun 2021 17:27:47 GMT"
..$ x-ua-compatible : chr "IE=edge,chrome=1"
..$ access-control-allow-origin: chr "*"
..$ x-frame-options : chr "deny"
..$ transfer-encoding : chr "chunked"
..- attr(*, "class")= chr [1:2] "insensitive" "list"
Oh yes, here is a really simple example of that:
df <- data.frame(x = 1:6, y = letters[1:6], stringsAsFactors = FALSE)
attr(df, "testattr") <- "Hello, I am an attribute"
df %>%
as_tibble() %>%
select(x) %>%
str()
tibble [6 x 1] (S3: tbl_df/tbl/data.frame)
$ x: int [1:6] 1 2 3 4 5 6
- attr(*, "testattr")= chr "Hello, I am an attribute"
df %>%
select(x) %>%
str()
'data.frame': 6 obs. of 1 variable:
$ x: int 1 2 3 4 5 6
So because find_oldest_site()
had a select()
call later in the function, doing as_tibble() before meant that select didn't cause the attributes to drop. It is weird to me that some dplyr verbs drop attributes and some don't. filter()
doesn't have this affect:
df %>%
filter(x != 3) %>%
str()
'data.frame': 5 obs. of 2 variables:
$ x: int 1 2 4 5 6
$ y: chr "a" "b" "d" "e" ...
- attr(*, "testattr")= chr "Hello, I am an attribute"
Yes, I was also surprised to find that only select
does it (of the verbs I've tried so far, anyway). I think explicitly dropping attributes may be a good idea anyway because then we'll have a visible reminder that dataRetrieval includes attributes. Or at least a comment about how you're achieving the dropping of attributes.
I guess I just use select()
so often that I never noticed this impact. I might think about adding to the "tips and tricks" (Pipelines II) training somewhere.
Made this obvious in the template repo via https://github.com/USGS-R/ds-pipelines-targets-3-template/commit/bde19d7fc9070925a71589ea04ab26934adb07b5
Calling this done-zo since the actual issues have been identified and fixed/coded around.
Still trying to solve a problem where we have one target that contains data for all the states and then we need to use that to fetch data for each state in a branching set up. Stuck because adding states causes the pipeline to rebuild everything if it changes the alphabetic order.
A small, reproducible example. Here is what I currently have for
_targets.R
:Result when using
tar_target(states, c('MI', 'MN'))
✔️Result when using
tar_target(states, c('MI', 'MN', 'PA'))
✔️Result when using
tar_target(states, c('MI', 'MN', 'PA', 'AL'))
❌