water quality portal: round 1 data pull code

The full data pull is now running on my computer with the code in this pull request. Raw LA and WI files are already pulled at https://drive.google.com/drive/folders/1FugFAWJ1BaCm0yUt6BjeRQ3PF3gsH-mh (there is no LA cdom).

WQP data are known to be inconsistent, so these raw data files have limited utility - in particular, observations are likely to have varying units. A future pull request (PR) will include munging functions like those at https://github.com/USGS-R/necsc-lake-modeling/blob/master/scripts/download_munge_wqp.R.

I can also add code (later) to combine data by state, constituent, or both. @matthewross07 , let me know which format would be best for you.

This code uses task remake files and google drive helpers as implemented in the cutting-edge version of the scipiper package at https://github.com/USGS-R/scipiper (v0.0.2). Those updates were pushed today, so you'll certainly need to update what's installed on your computer.

Google Drive:

gd_put and gd_get transfer files to and from Google Drive, respectively. They use local indicator files (ending in .ind) that sit in exactly the same directory location as the data file on both Google Drive and my computer. For example, I'll be committing 1_wqdata/out/wqp/Wisconsin_secchi_001.feather.ind soon, and this file is a placeholder for 1_wqdata/out/wqp/Wisconsin_secchi_001.feather on your computer (in the watersat project) and 1_wqdata/out/wqp/Wisconsin_secchi_001.feather on Google Drive (in the watersat Drive folder, which is known to the Git project by the google ID labeled project_folder in lib/cfg/gd_config.yml).

In conjunction with indicator files (ending in .ind) and the shared-cache functionality offered by scmake(), I'm hopeful that only one of us (me) will need to do the WQP pull and the rest of us will simply be pulling from Drive when we run scmake on targets requiring these files, even though the WQP pull is fully documented in remake.yml and tasks_1_wqp.yml. This shared-cache stuff is pretty new and will need some testing. What you'll see for now is a bunch of very small files with cryptic names in the build/status folder, which I'll be committing later, which are the key to you persuading scmake that you can go to Drive instead of to WQP to get the data.

Templated remake file (task makefile):

The calls to plan_wqp_pull and create_wqp_pull_makefile create a very long remake file named tasks_1_wqp.yml which has targets such as these:

  # --- Wisconsin_cdom_001 --- #

  partition_Wisconsin_cdom_001:
    command: filter_partitions(wqp_pull_partitions, I('Wisconsin_cdom_001'))

  1_wqdata/tmp/wqp/Wisconsin_cdom_001.feather.ind:
    command: get_wqp_data(
      ind_file=target_name,
      partition=partition_Wisconsin_cdom_001,
      wq_dates=wq_dates)

  1_wqdata/out/wqp/Wisconsin_cdom_001.feather.ind:
    command: gd_put(
      remote_ind=target_name,
      local_source='1_wqdata/tmp/wqp/Wisconsin_cdom_001.feather.ind',
      mock_get=I('move'),
      on_exists=I('replace'))

  1_wqdata/out/wqp/Wisconsin_cdom_001.feather:
    command: gd_get(
      ind_file='1_wqdata/out/wqp/Wisconsin_cdom_001.feather.ind')

  # --- Wisconsin_chlorophyll_001 --- #

  partition_Wisconsin_chlorophyll_001:
    command: filter_partitions(wqp_pull_partitions, I('Wisconsin_chlorophyll_001'))

  1_wqdata/tmp/wqp/Wisconsin_chlorophyll_001.feather.ind:
    command: get_wqp_data(
      ind_file=target_name,
      partition=partition_Wisconsin_chlorophyll_001,
      wq_dates=wq_dates)

  1_wqdata/out/wqp/Wisconsin_chlorophyll_001.feather.ind:
    command: gd_put(
      remote_ind=target_name,
      local_source='1_wqdata/tmp/wqp/Wisconsin_chlorophyll_001.feather.ind',
      mock_get=I('move'),
      on_exists=I('replace'))

  1_wqdata/out/wqp/Wisconsin_chlorophyll_001.feather:
    command: gd_get(
      ind_file='1_wqdata/out/wqp/Wisconsin_chlorophyll_001.feather.ind')

for every state/constituent combination (and sometimes partitioned further as indicated by 001, 002, etc.).

The recipe to build all of the state/constituent/partition targets is in remake.yml:

  1_wqdata/log/tasks_1_wqp.ind:
    command: loop_tasks(
      task_plan=wqp_pull_plan, task_makefile='tasks_1_wqp.yml',
      task_names=I(NULL),
      num_tries=I(30), sleep_on_error=I(20))

which uses my freshly spruced-up scipiper::loop_tasks to loop over the tasks and steps in tasks_1_wqp.yml. You can subset to specific tasks or steps by specifying non-null values of task_names or step_names, which I did while testing and to get LA and WI up there first, but I've now graduated to running all of them in one big loop with retries.

Efficiency notes:

The WQP pull is chunked into sets of sites and parameter codes so that the expected number of observations is probably less than 250,000 per call to readWQPdata. A previous project used a chunk size of 500,000 successfully, but I wanted the smaller chunk size to fully test out the task table functions - if we'd stuck to the larger chunks, just about every state would have a single pull per constituent, but this way there are a few states with several pulls per constituent.
I kept states and constituents separate with the idea that this approach would permit me to test a few states at first (this worked! see cc0bb33 for the code transition from pulling just LA,WI to pulling all states), would allow @wdwatkins's project to progressively add states without having to repull the old ones, uses geospatial info that's available for all sites (HUCs aren't reported for all sites =P, but states are), and would minimize the number of files that needed to be re-pulled if we adjusted the lists of constituents or parameter codes within constituent.
The guidance I remember from WQP-expert colleagues was (1) run these pulls one request at a time; multiple simultaneous requests could crash WQP. (2) subset spatially rather than temporally - loop over state/HUC rather than years. (3) run whole states at a time when possible. (4) it's probably faster to pull all constituents for some sites, rather than to pull some constituents for more sites, given equal numbers of observations in each case. but i went with the latter anyway to preserve flexibility in adding new constituents / revising the parameter code lists later. (5) minimize the overall number of constraints that WQP needs to use to pick out the requested sites & observations.

Example output from loop_tasks:

> library(scipiper)
USGS Support Package: https://owi.usgs.gov/R/packages.html#support
> scmake('1_wqdata/log/tasks_1_wqp.ind')
Starting build at 2017-11-19 13:59:50
<  MAKE > 1_wqdata/log/tasks_1_wqp.ind
[    OK ] wqp_state_codes
[    OK ] 1_wqdata/tmp/wqp
[    OK ] 1_wqdata/out/wqp
[    OK ] 1_wqdata/log
[    OK ] wqp_states
[    OK ] wqp_codes
[    OK ] wqp_pull
[    OK ] wqp_pull_folders
[    OK ] 1_wqdata/out/wqp_inventory.feather.ind
[    OK ] wqp_pull_partitions
[ BUILD ] wqp_pull_plan                          |  wqp_pull_plan <- plan_wqp_pull(partitions = wqp_pull_partitions, folders = wqp_pull_folders)
[  READ ]                                        |  # loading packages
[ BUILD ] tasks_1_wqp.yml                        |  create_wqp_pull_makefile(makefile = "tasks_1_wqp.yml", task_plan = wqp_pull_plan)
run all tasks with
1_wqdata/log/tasks_1_wqp.ind:
  command: make(I('tasks_1_wqp'), remake_file='tasks_1_wqp.yml')
[ BUILD ] 1_wqdata/log/tasks_1_wqp.ind           |  loop_tasks(task_plan = wqp_pull_plan, task_makefile = "tasks_1_wqp.yml", task_names = NULL, num...

### Starting loop attempt 1 of 30 with 186 tasks remaining:
Building task 1 of 186 in loop: 1_wqdata/out/wqp/Alabama_chlorophyll_001.feather.ind (#1 of 186 total)
WQP pull for Alabama_chlorophyll_001 took 80.22 seconds and returned 27574 rows
Auto-refreshing stale OAuth token.
Building task 2 of 186 in loop: 1_wqdata/out/wqp/Alabama_secchi_001.feather.ind (#2 of 186 total)
WQP pull for Alabama_secchi_001 took 85.02 seconds and returned 12865 rows
Building task 3 of 186 in loop: 1_wqdata/out/wqp/Alabama_tss_001.feather.ind (#3 of 186 total)
WQP pull for Alabama_tss_001 took 110.23 seconds and returned 42784 rows
Building task 4 of 186 in loop: 1_wqdata/out/wqp/Alaska_chlorophyll_001.feather.ind (#4 of 186 total)
WQP pull for Alaska_chlorophyll_001 took 63.5799999999999 seconds and returned 2086 rows
Building task 5 of 186 in loop: 1_wqdata/out/wqp/Alaska_secchi_001.feather.ind (#5 of 186 total)
WQP pull for Alaska_secchi_001 took 54.05 seconds and returned 881 rows
Building task 6 of 186 in loop: 1_wqdata/out/wqp/Alaska_tss_001.feather.ind (#6 of 186 total)
WQP pull for Alaska_tss_001 took 51.26 seconds and returned 1172 rows
Building task 7 of 186 in loop: 1_wqdata/out/wqp/American_Samoa_chlorophyll_001.feather.ind (#7 of 186 total)
WQP pull for American_Samoa_chlorophyll_001 took 19.61 seconds and returned 132 rows
Building task 8 of 186 in loop: 1_wqdata/out/wqp/Arizona_chlorophyll_001.feather.ind (#8 of 186 total)
WQP pull for Arizona_chlorophyll_001 took 65.6900000000001 seconds and returned 4260 rows
Building task 9 of 186 in loop: 1_wqdata/out/wqp/Arizona_secchi_001.feather.ind (#9 of 186 total)
WQP pull for Arizona_secchi_001 took 58.1700000000001 seconds and returned 3160 rows

GlobalHydrologyLab / AquaSat