USGS-R / drb-inland-salinity-ml

Code repo for Delaware River Basin machine learning models that predict inland salinity.
Creative Commons Zero v1.0 Universal
3 stars 4 forks source link

Active development moved to https://code.usgs.gov/wma/wp/drb-inland-salinity-ml

drb-inland-salinity-ml

This repository contains a pipeline for data gathering, processing, modeling, and visualizing results for machine learning models that predict salinity in inland reaches of the Delaware River Basin (DRB). This pipeline is built with targets version 0.11.0 and may not work with earlier versions.

Note: several targets are currently set to never update: NWIS specific conductivity data (described further in the "Building locally" section of this README), and Boruta attribute screening. The attribute screening is not being updated because it determines which features are used in the models.

New Users and Reviewers

Follow the steps below to build the pipeline with S3 within the singularity container on Tallgrass. For reviews, the code developer should provide the directory of the code to review, the name of the container used, and the slurm script to run the targets or to inspect the generated data.

Building the pipeline with S3 (default)

The default behavior is to build the pipeline with S3 as a shared data repository. To build the pipeline with the S3 storage, you will need to provide AWS credentials. We have been doing this using the saml2aws tool. See also @amsnyder's post on this.

Note: If an error is thrown when fetching a target from S3 (e.g., time out while downloading), that will cause the targets meta file to have an error in it for that target, and the target will be rebuilt the next time tar_make() is run. If you do not want that target to rebuild and instead want to try fetching it again, you can git reset _targets/meta/meta and then tar_make().

Access to credentials when using a container

Singularity on Tallgrass will generate and access the credentials without any extra steps because it uses the same file system as the host. Docker, however, does not use the host file system and therefore the credentials file would need to be mounted from the host onto the Docker file system. To do this, you can generate credentials manually with saml2aws login and paste the .aws/credentials file into the mounted repository directory. Note - when you do saml2aws login make sure you choose the gs-chs-wma-dev option because that is where the S3 bucket is.

Building locally

Most of the time you shouldn't have to build the pipeline with locally stored targets - you usually will be building it with S3 as the repository (see above). If for some reason local builds are needed for a particular target, you can add repository = 'local' to that tar_target()'s argument. To build the full pipeline locally, change the repository option in tar_objects_set in the _targets.R file and set NWIS_repository to 'local' in 1_fetch.R. WARNING - using a global repository = 'local' will trigger a rebuild of all targets. Alternatively, certain NWIS targets can be set to never rebuild, even when the global repository = 'local'. To prevent NWIS download targets from building when repository = 'local', keep NWIS_repository = 'aws' in 1_fetch.R so that these targets are pulled from S3. Note that if these targets are pulled from S3, then you will need to log in with aws credentials to build the pipeline. Commiting the targets meta file is not recommended after switching to 'local' because switching back to using repository = 'aws' will also trigger a rebuild of the targets that were locally built.

References

The study area segment shapefiles are from:

Oliver, S.K., Appling, A.A., Atshan, R., Watkins, W.D., Sadler, J., Corson-Dosch, H., Zwart, J.A., and Read, J.S., 2021, Data release: Predicting water temperature in the Delaware River Basin: U.S. Geological Survey data release, https://doi.org/10.5066/P9GD8I7A.