HARPgroup / HARParchive

This repo houses HARP code development items, resources, and intermediate work products.
1 stars 0 forks source link

Week 09/16 - 09/23 #1355

Open COBrogan opened 2 months ago

COBrogan commented 2 months ago
ilonah22 commented 2 months ago

Here is the path to the file I was working on today which finds max and min r-squared ranges on the data-analysis branch:

HARP-2024-2025/DrainageAreaAnalysis.R

COBrogan commented 2 months ago

Thanks @ilonah22 ! @nathanielf22 here is a useful GIS in R reference I've used in the past: https://r-spatial.github.io/sf/articles/sf1.html

Leafing through it, these tutorials seem helpful. The most relevant R code will be those that use the sf library as many other R spatial libraries are slotted for deprecation (like rgdal which used to be very popular) https://tmieno2.github.io/R-as-GIS-for-Economists/vector-basics.html https://bookdown.org/michael_bcalles/gis-crash-course-in-r/

There are several chapters in there to go over using sf to do basic GIS stuff in R. Otherwise, here's some extra code to help get some spatial plots going. It's pretty long and it's assuming you have the ratings files downloaded (easiest way is sftp, which Ilona has used before or you can reach out to me and I can get you going). From there, you need the ratings in their own folders all in a director called ratings. Let me know if you have any questions!

R Code for SF Plotting ``` r library(tidyverse) #Assumes all ratings files are in a folder ratings/daymet/ or ratings/prism/ for(j in c("daymet_stormvol","prism_stormvol","nldas_stormvol","daymet_simplelm","prism_simplelm","nldas_simplelm")){ print(j) #Set path to read in ratings based on data source of outer loop pathToread <- paste0("ratings/",j,"/") for( i in list.files(pathToread) ){ #For each file in the directory, read in the file i as a csv filei <- read.csv(paste0(pathToread,i)) #Store the gage number filei$gage <- gsub(".*_(\\d+)-.*","\\1",i) #Store the analysis type filei$workflow <- gsub(".*_","",j) #Keep only the necessary columns, depending on the workflow: if(filei$workflow[1] == "simplelm"){ filei <- filei[,c("mo","rating","gage","workflow")] }else{ filei <- filei[,c("mo","r_squared","gage","workflow")] } names(filei) <- c("mo","rating","gage","workflow") #Combine the file into a single data frame if(!exists("combineFile")){ combineFile <- filei }else{ combineFile <- rbind(combineFile,filei) } } #Assign to a specific variable and delete the generic combineFile assign(paste0('combineFile',j),combineFile) rm(combineFile) } #Join the daymet and prism data together joinData <- combineFileprism_stormvol %>% #Rename the rating for clarity select(prismRatingstorm = rating,gage,mo,workflow) %>% #Join in the dayment data, but first rename the rating column for clarity. #Join on gage, month, and workflow full_join(combineFiledaymet_stormvol %>% select(daymetRatingstorm = rating,gage,mo,workflow), by = c("gage","mo","workflow")) %>% full_join(combineFiledaymet_simplelm %>% select(daymetRatinglm = rating,gage,mo,workflow), by = c("gage","mo","workflow")) %>% #Add remaining prism data: full_join(combineFileprism_simplelm %>% select(prismRatinglm = rating,gage,mo,workflow), by = c("gage","mo","workflow")) %>% #Join in the nldas data, but first rename the rating column for clarity. #Join on gage, month, and workflow full_join(combineFilenldas_stormvol %>% select(nldasRatingstorm = rating,gage,mo,workflow), by = c("gage","mo","workflow")) %>% full_join(combineFilenldas_simplelm %>% select(nldasRatinglm = rating,gage,mo,workflow), by = c("gage","mo","workflow")) %>% #For easier viewing, combine lm and storm columns such that there is only one #column for prism, daymet, nldas classified by the workflow column mutate(prismRating = coalesce(prismRatingstorm,prismRatinglm), daymetRating = coalesce(daymetRatingstorm,daymetRatinglm), nldasRating = coalesce(nldasRatingstorm,nldasRatinglm) ) %>% #Remove now superflous columns: select(-prismRatingstorm,-prismRatinglm, -daymetRatingstorm,-daymetRatinglm, -nldasRatingstorm,-nldasRatinglm) %>% #Pivot it longer to have a column with the data source and one for the #ratings, for plotting ease pivot_longer(c(prismRating,daymetRating,nldasRating), names_to = 'dataSource', values_to = 'rating') %>% arrange(gage,mo,workflow) #At each gage, does the best performing data source change between workflows? #The below code is for ESTIMATES ONLY. The left_join assumes that the ratings #are unique between datasources for each gage, workflow, month. This is brittle #and could result in incorrect answers! gageCompare <- joinData %>% dplyr::ungroup() %>% #Group by gage, workflow, and month and find the max rating: dplyr::group_by(gage,workflow,mo) %>% dplyr::summarise(maxRating = max(rating,na.rm = TRUE)) %>% #Join the joinData df back in matching by gage, workflow, mo, and rating. This #could be problematic with duplicate ratings as in a case where all ratings #across the data sources are the same value left_join(joinData,by = c('gage','workflow','mo','maxRating' = 'rating')) %>% #Now pivot wider such that we can see ratings and best data sources side-by-side pivot_wider(names_from = workflow,values_from = c(maxRating,dataSource)) %>% #Now filter to only find instances where the best data sources are different: filter(dataSource_simplelm != dataSource_stormvol) %>% #Create a column showing the difference in rating and arrange by the difference mutate(differenceInRating = maxRating_simplelm - maxRating_stormvol) %>% arrange(differenceInRating) #Isolate one month gageCompareMonth <- gageCompare[gageCompare$mo == 10,] library(sf) #Get the watershed coverage from the server watershedGeo <- read.csv("http://deq1.bse.vt.edu:81/met/usgsGageWatershedGeofield.csv") #Get the gage numbers as their own field and store a copy of the data gageWatershedSF <- watershedGeo gageWatershedSF$gage <- gsub(".*_(\\d+)","\\1",gageWatershedSF$hydrocode) #Let's begin by plotting October daymet ratings for stormVol forPlot <- joinData[joinData$dataSource == 'daymetRating' & joinData$workflow == 'stormvol' & joinData$mo == 10,] #Join the geometry data onto out plot data joinDataSF <- forPlot %>% left_join(gageWatershedSF %>% select(-hydrocode), by = 'gage') %>% filter(!is.na(wkt)) #Create an SF object. Specify the coordinate system and the field name in the #data frame that contains the well-known text. In this case, WKT is the name of #the field with the polygon geometries joinDataSF <- st_as_sf(joinDataSF,wkt = 'wkt',crs = 4326) #Repair broken geometries joinDataSF <- st_make_valid(joinDataSF) #Add shape area in coordinate system units (likely meaningless in crs 4326) joinDataSF$area <- st_area(joinDataSF) #Order the data by largest to smallest area to make plotting more effective joinDataSF <- joinDataSF[order(joinDataSF$area,decreasing = TRUE),] #Plot SF polygons ggplot(joinDataSF) + geom_sf(aes(fill = rating)) + scale_fill_viridis_c(option = 'magma') + theme_classic() ```

The above code should generate the following image, which is showing October adjusted R squared values for daymet from the Storm Volume method: image

nathanielf22 commented 2 months ago

Thank you @COBrogan! I have been attempting to use the code, but having some issues with the creation of gageCompare, where it has all the dplyr steps. The ratings appear to be repeating, and it leads to the creation of list-cols that make the data frame unusable. Now, I'm trying to sftp the ratings again to be sure I have the right data, but I'm struggling to find the correct folder on the server to sftp from. Could you help point me in the right direction to get the data that works for this R script?

rburghol commented 2 months ago

@nathanielf22 -- no need for sftp, as all data should be available via web link with pattern: http://deq1.bse.vt.edu:81/met/[scenario]/

For example you can find the precip data and flow analyses for things that were run for scenario stormVol_nldas2 (storm volume analysis based on nkdas2 data) at - http://deq1.bse.vt.edu:81/met/stormVol_nldas2/

nathanielf22 commented 2 months ago

@rburghol, that makes sense. What about the simple lm data? The code includes those on the third line, but there isn't a folder labelled simplelm.

COBrogan commented 2 months ago

The simplelm data is in the folders labeled NLDAS, PRISM, and daymet. Those directories should be renamed eventually, but we haven’t done so yet. So, in ‘PRISM/stats/‘ should be the simplelm results. @rburghol i think ‘sftp’ is easier, unless there’s a way to read all those csv urls at once into R? So far, I’ve been telling everyone to just use sftp if they are analyzing ALL 1,080 csv files… If you agree, @nathanielf22 this data is in /media/model/met/ but you may need to use ‘ls’ to poke around the directories within there which is why the urls would be easier if there’s a way to get all of them. So, ‘/media/model/met/PRISM/out’ has the simple on stats

rburghol commented 2 months ago

@COBrogan I think that if we are at the point that we need to analyze thousands of files at the same time we're at the need to prioritize putting the analysis inside the workflow and setting summary metrics in the database.

Building local desktop workflows to analyze thousands of files that are downloaded via STP on the server we risk time wasted and trouble created with redundant analysis and download and then data sets coming out of date.

I think that we really should be focusing on small case study analysis right now rather than broadbrush global things. When we have analytical lenses that we think are worthy of repeating over 1000 files then we move forward.

As for the scenario names, for sure scenario nldas2 is actually that which has the nldas2 + simple lm data -- totally agree that we DO need to remedy this, and this will be a good thing to prioritize. It involves only creating a new .con file as a copy of nldas2.con. Then rerunning the workflows.

Interestingly on the server we have the .con files as well: http://deq1.bse.vt.edu:81/p6/vadeq/config/control/met/

COBrogan commented 2 months ago

Hmm. I agree that these metrics should be shifted into the db soon. I think we discussed making a data model sketch for that next week, but we talked about Nate beginning to create those workflows to analyze/QC precip data based on annual means, potential timestamp issues, and spatial visualization of precip annual data. This work will naturally become a workflow as it’s based on the precip files we’re generating, so no issues there. I believe the spatial visualization of mean precip for a year is a valuable tool. To test a workflow locally, you’d want to grab all existing coverage precip files. So, I think this work is still valuable as long as we keep in mind that precip file path or directory should be an input to the script. If we keep to one script = one purpose, then Nate will naturally develop scripts that create mean precip which will be a mandatory step in putting it in the db. Maybe we can chat a bit about this tomorrow so I can catch up on our intended structure, but I think Nate’s on the right track here to creating a dataset QC workflow that may become part of geo/process and geo/analyze

rburghol commented 2 months ago

@COBrogan I 100% agree that we will eventually want to see global trends in all sorts of variables, but the QA workflow step is at the single coverage level, not over the global data sets. I think I may have been too vague in my comments on this previously. I will elaborate below. @nathanielf22

In our workflows, everything operates at a coverage level, so in developing the QA steps we should:

  1. Write code to load load single coverage data sets (perhaps multiple components of the data, like precipitation, flow, model predictions…)
  2. calculate metrics for the single coverage described by the data set (mean, min annual, max, 90 day max/min, single hiurly max, daily max, ...)
  3. Create visualizations for that single coverage.
  4. Look for bad values, like NULLs, negatives, or perhaps values that exceed some threshold
  5. Do overall summaries that include record counts.
  6. Dive deep into the data set, trying to find places in the time series where our relationships are good or bad or just plain interesting. This is the kind of think Nate was showing yesterday and it's spot on but we need to go further and also make sure we are relating the files from a single scenario together (unless we are purposely comparing scenarios).
  7. Once we have something that works well, i.e. we have metrics that we think reflect the integrity of the data set, Then we can plug that directly into the workflow that will then get executed for different cases.
  8. We can also add in comparisons scripts between two scenarios for a single coverage, for example, the timestamp offset analysis that @COBrogan piloted before.

Otherwise, we spend time doing file management (sftp), and we write a code workflow that iterates through files rather than a standalone workflow to handle analyzing a single file, for a single coverage. Then we have to disentangle our batch code to operate standalone.

But we already have a robust set of code to allow us to retrieve all the metrics from all the scenarios at one single time in a very efficient manner om_vahydro_metric_grid, and we have a workflow and file structure to handle iterating over 1,000s of combos, we just have to get ourselves prepared to use it. And preparing to use it is exactly what is done by the single coverage/data set workflow components outlined above. And once we have them, we will insert them via REST, and then map to our hearts content.

COBrogan commented 1 month ago

Working on rerunning workflows simple_lm and stormVol to incorporate changes to JSON and calc_raster_ts. The following workflows need to be rerun:

ALL other methods are out of date!

ilonah22 commented 4 days ago

Update for Thanksgiving break:

On Tuesday of this week, I met with Connor to integrate some of the scripts I have been working on recently into the workflow, which went relatively smooth. There were a few issues that came up with some steps of the workflow that we tried to fix as we went.

My next step is going to be trying to split up the StormEventsLM_cmd.R script into multiple parts, because right now it's performing multiple functions in the same script, which we don't want for our final workflow.