SimonTopp / drb_generalizability_data_release

This is the model archive and data release associated with Topp et al. 2022
Creative Commons Zero v1.0 Universal
0 stars 1 forks source link

Process Step #4

Closed amsnyder closed 1 year ago

amsnyder commented 1 year ago

I think your process step in the metadata should give an overview of what is happening in your model code. You can describe this in multiple steps. I will need to figure out how to include multiple steps using our pipeline/template, but here's an idea of the type of content you want to include, taken from this model archive.

  1. Download and process NWIS streamflow data. A set of scripts (https://code.usgs.gov/map/maprandomforest) were used to download daily streamflow data from the National Water Information System (NWIS), clean up the dataset, calculate baseflow separation by means of the PART algorithm, and aggregate the data by month and year.

  2. Watershed characteristics associated with the stream flow points of interest were calculated. Processing starts with a shapefile that captures the upstream drainage area associated with each stream flow point of interest. For training the random forest model, this was a shapefile of the basins associated with each stream gage. A shapefile of ungaged basins was used to generate watershed characteristics for points of interest for which no stream flow gaging exists.

    Zonal statistics were calculated for each basin, and included: 1) predominant surficial geology classification 2) elevation statistics (min, max, mean, std. deviation) 3) weather statistics: tmin, tmax, precipitation, and several lagged derivatives 4) estimated Hargreaves-Samani reference ET0, calculated from the weather data

  3. Data assembly. Processed NWIS data (step 1) were joined to processed watershed characteristics (step 2) to produce a file that could be used to train the random forest model (testFlowDat.csv). The files 'testGagesDat.csv' and 'testUngagedDat.csv' were processed similarly, but contain no baseflow or total flow data items; these represent inputs that can be used by the random forest model to produce flow estimates at gaged sites that were missing records, or at ungaged sites.

  4. Random forest training. The random forest model was trained on the inputs contained in 'testFlowDat.csv'. Model testing revealed that neither the elevation nor the surficial geological classifications data appreciably improved the model outputs; subsequently, although the inputs contain columns of data containing the Reed and Bush surficial geology and the elevation statistics, they were not used in the final random forest model.

    The random forest training process produced two random forest 'objects': one was trained on total surface flow (original_trained_rf_model_object__total_flow.RData), the other on stream baseflows (original_trained_rf_model_object__baseflow.RData). These random forest objects may be fed new input data in order to calculate baseflow and total stream flow at other ungaged sites.

  5. Apply random forest model. The files 'testGagedDat.csv' and 'testUngagedDat.csv' were used as input to the random forest model to fill in missing slices of data for gaged and ungaged (respectively) basins. The output of this step was captured in the file 'XXXXX'. A second set of estimates was generated for a second set of points (XXXX.csv).

SimonTopp commented 1 year ago

Updated the process description to be more detailed akin to example.