DOI-USGS / national-flow-observations

This repository pulls national flow data from NWIS
Other
4 stars 8 forks source link

Change to 3 target system for long-running targets #14

Closed lindsayplatt closed 3 years ago

lindsayplatt commented 4 years ago

This essentially applies to all of the targets in this pipeline, but we should pull the s3_put outside of the initial build fxn so that if the s3_put fails (for instance, when disconnected from VPN) the whole build doesn't need to happen again. Please see the section that starts with 3 target method in the Guidelines page here: https://usgs-r.github.io/scipiper/shared_cache.html.

wdwatkins commented 3 years ago

Has this already been done? Looks like the push to S3 doesn't happen until all the data is already downloaded in each task for the data partitions, and would be written to disk. The S3 push happens in the finalize function:https://github.com/USGS-R/national-flow-observations/blob/master/10_nwis_pull/src/nwis_pull.R#L59 https://github.com/USGS-R/national-flow-observations/blob/master/10_nwis_pull/src/nwis_combine_functions.R#L22

lindsayplatt commented 3 years ago

Oh wow, have to refresh my memory on this pipeline. I don't think this has happened yet since this issue was made on June 25 and the last commit was made June 24.

I think this is saying that the current two-target approach (particularly in the download step), should be turned into a three-target approach, where s3_put is called for a separate target between these ones https://github.com/USGS-R/national-flow-observations/blob/7981cfe1deb4030c821911b93a96902089fc6ea3/10_nwis_pull.yml#L44-L50 So this line would become its own target rather than part of this function. https://github.com/USGS-R/national-flow-observations/blob/7981cfe1deb4030c821911b93a96902089fc6ea3/10_nwis_pull/src/inventory_nwis.R#L22

If I can remember correctly, this was because sometimes during this lengthy process, your S3 session would timeout and you would get an error right at the end so we wanted to break those steps apart.

lindsayplatt commented 3 years ago

WOAH THAT PREVIEW IS CRAZY COOL!

limnoliver commented 3 years ago

Wow, that's awesome and will save lots of clicking!

wdwatkins commented 3 years ago

An S3 timeout wouldn't require downloading everything again though, right? Just re-checking all the download tasks, which I suppose could be lengthy?

lindsayplatt commented 3 years ago

That is the behavior we want but because the s3_put is called within the function that does the downloading, a failure would result in the download targets needing to be rebuilt. So we want to separate into a download target and then an upload target, rather than both together which is how they are written now.