Closed lindsayplatt closed 3 years ago
Has this already been done? Looks like the push to S3 doesn't happen until all the data is already downloaded in each task for the data partitions, and would be written to disk. The S3 push happens in the finalize function:https://github.com/USGS-R/national-flow-observations/blob/master/10_nwis_pull/src/nwis_pull.R#L59 https://github.com/USGS-R/national-flow-observations/blob/master/10_nwis_pull/src/nwis_combine_functions.R#L22
Oh wow, have to refresh my memory on this pipeline. I don't think this has happened yet since this issue was made on June 25 and the last commit was made June 24.
I think this is saying that the current two-target approach (particularly in the download step), should be turned into a three-target approach, where s3_put is called for a separate target between these ones https://github.com/USGS-R/national-flow-observations/blob/7981cfe1deb4030c821911b93a96902089fc6ea3/10_nwis_pull.yml#L44-L50 So this line would become its own target rather than part of this function. https://github.com/USGS-R/national-flow-observations/blob/7981cfe1deb4030c821911b93a96902089fc6ea3/10_nwis_pull/src/inventory_nwis.R#L22
If I can remember correctly, this was because sometimes during this lengthy process, your S3 session would timeout and you would get an error right at the end so we wanted to break those steps apart.
WOAH THAT PREVIEW IS CRAZY COOL!
Wow, that's awesome and will save lots of clicking!
An S3 timeout wouldn't require downloading everything again though, right? Just re-checking all the download tasks, which I suppose could be lengthy?
That is the behavior we want but because the s3_put
is called within the function that does the downloading, a failure would result in the download targets needing to be rebuilt. So we want to separate into a download target and then an upload target, rather than both together which is how they are written now.
This essentially applies to all of the targets in this pipeline, but we should pull the
s3_put
outside of the initial build fxn so that if thes3_put
fails (for instance, when disconnected from VPN) the whole build doesn't need to happen again. Please see the section that starts with3 target method
in the Guidelines page here: https://usgs-r.github.io/scipiper/shared_cache.html.