Deep Dive segments - Githubissues

rburghol commented 3 months ago

Segments that need in depth QA due to large errors, missing data, or peculiar interest. @mwdunlap2004 @COBrogan @ilonah22

Too small?:
- L51079 https://github.com/HARPgroup/model_meteorology/issues/72#issuecomment-2299019132
Too Big?:
- usgs_ws_02071000 (hydroid 437550):

usgs_ws_02071000 Dan River at Wentworth NC

stuck for 12 hours in slurm running standard calc_raster_ts with Calling: /opt/model/model_meteorology/sh/calc_raster_ts usgs_ws_02071000 nldas2_obs_hourly /tmp/usgs_ws_02071000_1725581950_277/usgs_ws_02071000-nldas2-all.csv.sql /tmp/usgs_ws_02071000-nldas2-all.csv dbase2 drupal.dh03
Check the geometry on this, as it behaves strangely in vahydro (cannot view containment page)
- Not in-valid:
  - select st_isvaliddetail(dh_geofield_geom) from field_data_dh_geofield where entity_id = 437550;
  - st_isvaliddetail ---> (t,,)
- Not too big:
  - Dan at Wenthworth NC:
  - select st_area2d(dh_geofield_geom) from field_data_dh_geofield where entity_id = 437550;
  - 0.2726207204175703
  - Dan at South Boston VA:
  - select st_area2d(dh_geofield_geom) from field_data_dh_geofield where entity_id = 290049;
  - 0.7199793220968411
- has only 1 polygon:
  - select st_numgeometries(dh_geofield_geom) from field_data_dh_geofield where entity_id = 437550;
  - st_numgeometries --> 1

L51079

NLDAS2 resolution, 15×15
scenario nldas2
had a class not appear to be too small to get valid results
What's the file filled with NA: http://deq1.bse.vt.edu:81/met/out/lseg_csv/nldas2/L51079.PRC
Cells from old cbp6 work flow:
- Use: ./nldas_land_cells Land_segment
- Output Format: #_of_pairs Cell1_X Cell1_Y Cell2_x Cell2_Y ...
- ./nldas_land_cells L51079
- 6 372 106 373 106 374 106 372 107 373 107 374 107

rburghol commented 2 months ago

Hey @COBrogan, @ilonah22, @nathanielf22 and @mwdunlap2004 - the batch run that Michael did last night cruised along and did the simple LM for many many segments, but has been stalled for almost 12 hours doing the Dan River at Wentworth NC (usgs_ws_02071000). I am 99% sure that this is the same segment that stalled when Connor ran it a couple weeks ago. Now, it is a fairly large area, so it could be that it is just too big, however, since this same one has come up twice, and not some other large segment like the James River at Richmond or something, I am wondering if there might be some hinkiness in the geometry that is causing troubles?

Other than that thought, I have no real ideas on how to debug it -- we can check st_isvalid() on the geometry I suppose? I will note that I tried accessing that shape in the vahydro watershed browser (which runs a spatial containment query) and it has been running for a several minutes without returning results, however, I can view the Dan River at South Boston (taking less than 15 seconds) which I think contains the Wentworth segment further bolstering the bad geometry hypothesis.

Anyhow, I did not yet kill Michael's job on that segment, but I think we can check in on it later today and give it the adios if it hasn't completed -- queries that run perpetually can have a very bad impact on overall system performance.

mwdunlap2004 commented 2 months ago

Another item of note is that we have 156 nldas2 gages, 182 PRISM gages, and 184 daymet gages. I'm not sure why the numbers are different, but I will say for at least one of the datasets I somehow managed to download storm_vol results for at least one gage.

COBrogan commented 2 months ago

@mwdunlap2004 At least some of the missing NLDAS gages are a result of the issues we've been having with st_clip. I looked into a few, but others may be failing for other reasons. The PRISM and daymet numbers are akin to those run via the storm volume method (and it's possible the 2 additional daymet gages failed for PRISM due to st_clip, but seems a bit more unlikely). We will want to take a close look at some of those that failed i.e. run the workflow step by step and identify the process that is failing.

The storm_vol results you downloaded from PRISM were generated months ago and must have been an early test. I have deleted them from /media/model/met/PRISM/out/ to prevent that issue in the future.

Here's some early images to help visualize the results @nathanielf22 @ilonah22. First, all data (this one is a little hard to read):

Now just simple lm:

Now just storm volume:

There are 901 instances in which the two methods predict different "best" data sources for a given gage and month. The ratings difference between these "best" data sets range from -100 - 50% (with a negative value indicating a much better performance by the storm volume method). Code below but you'll need to tweak the file paths in lines 3 and 6 because my code assumes you have folders for each data with the names in line 3, all of which are in the ratings/ directory in line 6:

**Code for plots and gage comparisons**

``` r library(tidyverse) #Assumes all ratings files are in a folder ratings/daymet/ or ratings/prism/ for(j in c("daymet_stormvol","prism_stormvol","nldas_stormvol","daymet_simplelm","prism_simplelm","nldas_simplelm")){ print(j) #Set path to read in ratings based on data source of outer loop pathToread <- paste0("ratings/",j,"/") for( i in list.files(pathToread) ){ #For each file in the directory, read in the file i as a csv filei <- read.csv(paste0(pathToread,i)) #Store the gage number filei$gage <- gsub(".*_(\\d+)-.*","\\1",i) #Store the analysis type filei$workflow <- gsub(".*_","",j) #Keep only the necessary columns, depending on the workflow: if(filei$workflow[1] == "simplelm"){ filei <- filei[,c("mo","rating","gage","workflow")] }else{ filei <- filei[,c("mo","r_squared","gage","workflow")] } names(filei) <- c("mo","rating","gage","workflow") #Combine the file into a single data frame if(!exists("combineFile")){ combineFile <- filei }else{ combineFile <- rbind(combineFile,filei) } } #Assign to a specific variable and delete the generic combineFile assign(paste0('combineFile',j),combineFile) rm(combineFile) } #Join the daymet and prism data together joinData <- combineFileprism_stormvol %>% #Rename the rating for clarity select(prismRatingstorm = rating,gage,mo,workflow) %>% #Join in the dayment data, but first rename the rating column for clarity. #Join on gage, month, and workflow full_join(combineFiledaymet_stormvol %>% select(daymetRatingstorm = rating,gage,mo,workflow), by = c("gage","mo","workflow")) %>% full_join(combineFiledaymet_simplelm %>% select(daymetRatinglm = rating,gage,mo,workflow), by = c("gage","mo","workflow")) %>% #Add remaining prism data: full_join(combineFileprism_simplelm %>% select(prismRatinglm = rating,gage,mo,workflow), by = c("gage","mo","workflow")) %>% #Join in the nldas data, but first rename the rating column for clarity. #Join on gage, month, and workflow full_join(combineFilenldas_stormvol %>% select(nldasRatingstorm = rating,gage,mo,workflow), by = c("gage","mo","workflow")) %>% full_join(combineFilenldas_simplelm %>% select(nldasRatinglm = rating,gage,mo,workflow), by = c("gage","mo","workflow")) %>% #For easier viewing, combine lm and storm columns such that there is only one #column for prism, daymet, nldas classified by the workflow column mutate(prismRating = coalesce(prismRatingstorm,prismRatinglm), daymetRating = coalesce(daymetRatingstorm,daymetRatinglm), nldasRating = coalesce(nldasRatingstorm,nldasRatinglm) ) %>% #Remove now superflous columns: select(-prismRatingstorm,-prismRatinglm, -daymetRatingstorm,-daymetRatinglm, -nldasRatingstorm,-nldasRatinglm) %>% #Pivot it longer to have a column with the data source and one for the #ratings, for plotting ease pivot_longer(c(prismRating,daymetRating,nldasRating), names_to = 'dataSource', values_to = 'rating') %>% arrange(gage,mo,workflow) #Plot the data joinData %>% filter(workflow == "stormvol") %>% ggplot() + #Box plot, with month on the x-axis (note that it must be a factor) and rating #on the y-axis. Create separate boxes for each datasource via color geom_boxplot(aes(as.factor(mo),rating,color = dataSource)) + xlab(element_blank()) + ylab("Adjusted R^2") + #Limit the axis between 0 - 1, removing negative adjusted R squared values coord_cartesian(ylim = c(0,1)) + #Set a classic theme and give some gridlines for ease of reference theme_classic() + theme( panel.grid.major.y = element_line(linetype = 3, color = "grey50") ) + #Color the boxes scale_color_manual(values = c("dodgerblue3", "violetred4","black")) #Plot the data (simplelm) joinData %>% filter(workflow == "simplelm") %>% ggplot() + #Box plot, with month on the x-axis (note that it must be a factor) and rating #on the y-axis. Create separate boxes for each datasource via color geom_boxplot(aes(as.factor(mo),rating,color = dataSource)) + xlab(element_blank()) + ylab("Adjusted R^2") + #Limit the axis between 0 - 1, removing negative adjusted R squared values coord_cartesian(ylim = c(0,1)) + #Set a classic theme and give some gridlines for ease of reference theme_classic() + theme( panel.grid.major.y = element_line(linetype = 3, color = "grey50") ) + #Color the boxes scale_color_manual(values = c("dodgerblue3", "violetred4","black")) #Plot the all data (both workflows) joinData %>% ggplot() + #Box plot, with month on the x-axis (note that it must be a factor) and rating #on the y-axis. Create separate boxes for each datasource via color geom_boxplot(aes(as.factor(mo),rating,color = paste0(dataSource,workflow))) + xlab(element_blank()) + ylab("Adjusted R^2") + #Limit the axis between 0 - 1, removing negative adjusted R squared values coord_cartesian(ylim = c(0,1)) + #Set a classic theme and give some gridlines for ease of reference theme_classic() + theme( panel.grid.major.y = element_line(linetype = 3, color = "grey50") ) + #Color the boxes scale_color_manual(values = c("dodgerblue3", "steelblue", "violetred4","pink4", "black","grey50")) #At each gage, does the best performing data source change between workflows? #The below code is for ESTIMATES ONLY. The left_join assumes that the ratings #are unique between datasources for each gage, workflow, month. This is brittle #and could result in incorrect answers! gageCompare <- joinData %>% dplyr::ungroup() %>% #Group by gage, workflow, and month and find the max rating: dplyr::group_by(gage,workflow,mo) %>% dplyr::summarise(maxRating = max(rating,na.rm = TRUE)) %>% #Join the joinData df back in matching by gage, workflow, mo, and rating. This #could be problematic with duplicate ratings as in a case where all ratings #across the data sources are the same value left_join(joinData,by = c('gage','workflow','mo','maxRating' = 'rating')) %>% #Now pivot wider such that we can see ratings and best data sources side-by-side pivot_wider(names_from = workflow,values_from = c(maxRating,dataSource)) %>% #Now filter to only find instances where the best data sources are different: filter(dataSource_simplelm != dataSource_stormvol) %>% #Create a column showing the difference in rating and arrange by the difference mutate(differenceInRating = maxRating_simplelm - maxRating_stormvol) %>% arrange(differenceInRating) ```

rburghol commented 2 months ago

Thanks for the review on these results @COBrogan !!

I would add that "best" in my mind does not refer to the analysis method, but rather, to the data source. To that, I think it is quite interesting the handful of months where the simple lm shows large differences between NLDAS2 and prism/daymet (April, May, November, December). I think this is intriguing since the goal is not to get the best R^2, but in a sense, to find where the data sources differentiate the most. Of course, it may very well be that the difference in those datasets shown by lm is a spurious artifact of the method itself (or some data error!), but only time will tell -- I think we should add some of the USGS gages with big differences there into this deep dive list and begin testing them by running models as well as plotting out what exactly is happening here. (maybe we can start thinking about monthly flow error calcs between model and observed to bolster this?).

To better enable us to answer these questions, we need to get these into the REST database, so that they are accessible with om_vahydro_metric_grid to facilitate always up to date comparisons. I'll take the lead on getting everyone up to speed, and showing some template scripts in the meta_model workflow that I have been using for the WDM creation path.

mwdunlap2004 commented 2 months ago

@rburghol @COBrogan I created plots for 8 gages that ran for nldas2 last night of the 10 that Ilona had selected for use previously, and this matches my results, nldas2 seems to be consistently worse than daymet and PRISM. Daymet and PRISM also seem to be very similar to each other and I didn't notice many differences in their results, and in many cases their lines were eerily similar. I'm going to run some summary statistics on these gages after I talk to Dr. Scott today about how to handle nldas2 missing 2 gages (in regards to my presentation). In the long term, I think improving the clipping methods could help this?

rburghol commented 2 months ago

@mwdunlap2004 -- the clipping methods will certainly fix this. I am 99.9% certain these are just the result of the overlap algorithm that we are using, and the fix is either to resample or to use the polygon intersection method -- both of which work, but performance is an issue. So, it is totally cool to have cases where your data is not workign out, for reasons of resolution - very good point to make to the audience imo (@COBrogan may have other thoughts)

HARPgroup / model_meteorology

Deep Dive segments #83

usgs_ws_02071000 Dan River at Wentworth NC

L51079