GlobalHydrologyLab / AquaSat

Monitoring water quality from space!
MIT License
47 stars 16 forks source link

water quality portal: round 1 data pull code #3

Closed aappling-usgs closed 6 years ago

aappling-usgs commented 6 years ago

The full data pull is now running on my computer with the code in this pull request. Raw LA and WI files are already pulled at https://drive.google.com/drive/folders/1FugFAWJ1BaCm0yUt6BjeRQ3PF3gsH-mh (there is no LA cdom).

WQP data are known to be inconsistent, so these raw data files have limited utility - in particular, observations are likely to have varying units. A future pull request (PR) will include munging functions like those at https://github.com/USGS-R/necsc-lake-modeling/blob/master/scripts/download_munge_wqp.R.

I can also add code (later) to combine data by state, constituent, or both. @matthewross07 , let me know which format would be best for you.

This code uses task remake files and google drive helpers as implemented in the cutting-edge version of the scipiper package at https://github.com/USGS-R/scipiper (v0.0.2). Those updates were pushed today, so you'll certainly need to update what's installed on your computer.

Google Drive:

gd_put and gd_get transfer files to and from Google Drive, respectively. They use local indicator files (ending in .ind) that sit in exactly the same directory location as the data file on both Google Drive and my computer. For example, I'll be committing 1_wqdata/out/wqp/Wisconsin_secchi_001.feather.ind soon, and this file is a placeholder for 1_wqdata/out/wqp/Wisconsin_secchi_001.feather on your computer (in the watersat project) and 1_wqdata/out/wqp/Wisconsin_secchi_001.feather on Google Drive (in the watersat Drive folder, which is known to the Git project by the google ID labeled project_folder in lib/cfg/gd_config.yml).

In conjunction with indicator files (ending in .ind) and the shared-cache functionality offered by scmake(), I'm hopeful that only one of us (me) will need to do the WQP pull and the rest of us will simply be pulling from Drive when we run scmake on targets requiring these files, even though the WQP pull is fully documented in remake.yml and tasks_1_wqp.yml. This shared-cache stuff is pretty new and will need some testing. What you'll see for now is a bunch of very small files with cryptic names in the build/status folder, which I'll be committing later, which are the key to you persuading scmake that you can go to Drive instead of to WQP to get the data.

Templated remake file (task makefile):

The calls to plan_wqp_pull and create_wqp_pull_makefile create a very long remake file named tasks_1_wqp.yml which has targets such as these:

  # --- Wisconsin_cdom_001 --- #

  partition_Wisconsin_cdom_001:
    command: filter_partitions(wqp_pull_partitions, I('Wisconsin_cdom_001'))

  1_wqdata/tmp/wqp/Wisconsin_cdom_001.feather.ind:
    command: get_wqp_data(
      ind_file=target_name,
      partition=partition_Wisconsin_cdom_001,
      wq_dates=wq_dates)

  1_wqdata/out/wqp/Wisconsin_cdom_001.feather.ind:
    command: gd_put(
      remote_ind=target_name,
      local_source='1_wqdata/tmp/wqp/Wisconsin_cdom_001.feather.ind',
      mock_get=I('move'),
      on_exists=I('replace'))

  1_wqdata/out/wqp/Wisconsin_cdom_001.feather:
    command: gd_get(
      ind_file='1_wqdata/out/wqp/Wisconsin_cdom_001.feather.ind')

  # --- Wisconsin_chlorophyll_001 --- #

  partition_Wisconsin_chlorophyll_001:
    command: filter_partitions(wqp_pull_partitions, I('Wisconsin_chlorophyll_001'))

  1_wqdata/tmp/wqp/Wisconsin_chlorophyll_001.feather.ind:
    command: get_wqp_data(
      ind_file=target_name,
      partition=partition_Wisconsin_chlorophyll_001,
      wq_dates=wq_dates)

  1_wqdata/out/wqp/Wisconsin_chlorophyll_001.feather.ind:
    command: gd_put(
      remote_ind=target_name,
      local_source='1_wqdata/tmp/wqp/Wisconsin_chlorophyll_001.feather.ind',
      mock_get=I('move'),
      on_exists=I('replace'))

  1_wqdata/out/wqp/Wisconsin_chlorophyll_001.feather:
    command: gd_get(
      ind_file='1_wqdata/out/wqp/Wisconsin_chlorophyll_001.feather.ind')

for every state/constituent combination (and sometimes partitioned further as indicated by 001, 002, etc.).

The recipe to build all of the state/constituent/partition targets is in remake.yml:

  1_wqdata/log/tasks_1_wqp.ind:
    command: loop_tasks(
      task_plan=wqp_pull_plan, task_makefile='tasks_1_wqp.yml',
      task_names=I(NULL),
      num_tries=I(30), sleep_on_error=I(20))

which uses my freshly spruced-up scipiper::loop_tasks to loop over the tasks and steps in tasks_1_wqp.yml. You can subset to specific tasks or steps by specifying non-null values of task_names or step_names, which I did while testing and to get LA and WI up there first, but I've now graduated to running all of them in one big loop with retries.

Efficiency notes:

aappling-usgs commented 6 years ago

Example output from loop_tasks:

> library(scipiper)
USGS Support Package: https://owi.usgs.gov/R/packages.html#support
> scmake('1_wqdata/log/tasks_1_wqp.ind')
Starting build at 2017-11-19 13:59:50
<  MAKE > 1_wqdata/log/tasks_1_wqp.ind
[    OK ] wqp_state_codes
[    OK ] 1_wqdata/tmp/wqp
[    OK ] 1_wqdata/out/wqp
[    OK ] 1_wqdata/log
[    OK ] wqp_states
[    OK ] wqp_codes
[    OK ] wqp_pull
[    OK ] wqp_pull_folders
[    OK ] 1_wqdata/out/wqp_inventory.feather.ind
[    OK ] wqp_pull_partitions
[ BUILD ] wqp_pull_plan                          |  wqp_pull_plan <- plan_wqp_pull(partitions = wqp_pull_partitions, folders = wqp_pull_folders)
[  READ ]                                        |  # loading packages
[ BUILD ] tasks_1_wqp.yml                        |  create_wqp_pull_makefile(makefile = "tasks_1_wqp.yml", task_plan = wqp_pull_plan)
run all tasks with
1_wqdata/log/tasks_1_wqp.ind:
  command: make(I('tasks_1_wqp'), remake_file='tasks_1_wqp.yml')
[ BUILD ] 1_wqdata/log/tasks_1_wqp.ind           |  loop_tasks(task_plan = wqp_pull_plan, task_makefile = "tasks_1_wqp.yml", task_names = NULL, num...

### Starting loop attempt 1 of 30 with 186 tasks remaining:
Building task 1 of 186 in loop: 1_wqdata/out/wqp/Alabama_chlorophyll_001.feather.ind (#1 of 186 total)
WQP pull for Alabama_chlorophyll_001 took 80.22 seconds and returned 27574 rows
Auto-refreshing stale OAuth token.
Building task 2 of 186 in loop: 1_wqdata/out/wqp/Alabama_secchi_001.feather.ind (#2 of 186 total)
WQP pull for Alabama_secchi_001 took 85.02 seconds and returned 12865 rows
Building task 3 of 186 in loop: 1_wqdata/out/wqp/Alabama_tss_001.feather.ind (#3 of 186 total)
WQP pull for Alabama_tss_001 took 110.23 seconds and returned 42784 rows
Building task 4 of 186 in loop: 1_wqdata/out/wqp/Alaska_chlorophyll_001.feather.ind (#4 of 186 total)
WQP pull for Alaska_chlorophyll_001 took 63.5799999999999 seconds and returned 2086 rows
Building task 5 of 186 in loop: 1_wqdata/out/wqp/Alaska_secchi_001.feather.ind (#5 of 186 total)
WQP pull for Alaska_secchi_001 took 54.05 seconds and returned 881 rows
Building task 6 of 186 in loop: 1_wqdata/out/wqp/Alaska_tss_001.feather.ind (#6 of 186 total)
WQP pull for Alaska_tss_001 took 51.26 seconds and returned 1172 rows
Building task 7 of 186 in loop: 1_wqdata/out/wqp/American_Samoa_chlorophyll_001.feather.ind (#7 of 186 total)
WQP pull for American_Samoa_chlorophyll_001 took 19.61 seconds and returned 132 rows
Building task 8 of 186 in loop: 1_wqdata/out/wqp/Arizona_chlorophyll_001.feather.ind (#8 of 186 total)
WQP pull for Arizona_chlorophyll_001 took 65.6900000000001 seconds and returned 4260 rows
Building task 9 of 186 in loop: 1_wqdata/out/wqp/Arizona_secchi_001.feather.ind (#9 of 186 total)
WQP pull for Arizona_secchi_001 took 58.1700000000001 seconds and returned 3160 rows