Run model with medium observed sites

galengorski commented 2 years ago

I am going to add a model run for medium observed sites, see if I can produce results and troubleshoot issues while Jeff is still around :(

galengorski commented 2 years ago

I am first trying to run the 0_baseline_LSTM model before I add another model with the medium observed sites. When I run tar_make(p2a_metrics_files) I get the following error:

it looks like an environment issue to me with the 'snakemake' not found errors. This seems like it would be solved by running on tallgrass. I haven't worked with github and tallgrass together, @jsadler2, I'm wondering if you have suggestions on what a good workflow would be. Should I push the branch that I am working on now on my local up to the github repo and then pull it into tallgrass...?

jsadler2 commented 2 years ago

Yep. Pushing from local to GH and then pulling from GH to Tallgrass works great.

galengorski commented 2 years ago

Cool thanks, what is the best way to pull from gh on tallgrass. Do you do it through RStudio?

jsadler2 commented 2 years ago

You should clone your repo, then checkout your fork (though I don't see your fork listed, so I'll show what it would look like for mine). You can do this on the commandline. I'd recommend that over Rstudio.

cd /caldera/projects/usgs/water/impd/pump/drb-do
git clone git@github.com:jsadler2/drb-do-ml.git drb-do-ml-ggorski  # putting a suffix on the clone since it's a shared folder
git checkout [your branch]

galengorski commented 2 years ago

@lekoenig , here is a little update on this issue. I am running into some errors that I think are associated with the dimensions of the new NHD data compared to the older NHM inputs.

I am trying to run the 0_baseline_model with the new NHD fabric data for the well observed sites
I am able to run the p2a_well_obs_data target and it produces the output well_obs_io.zarr
One change I had to make was to change the input x variables from column names derived from NHM input to those derived from NHD input
- The old column names:
- x_vars: ['seg_ccov', 'seg_rain', 'seg_slope', 'seg_tave_air', 'seginc_swrad', 'hru_slope', 'hru_aspect', 'hru_elev', 'hru_area', 'hru_percent_imperv', 'covden_sum', 'covden_win', 'soil_moist_max']
- The new column names:
- x_vars: ['CAT_CNPY11_BUFF100','pr','SLOPE','tmmx','srad','CAT_BASIN_SLOPE','CAT_BASIN_AREA','CAT_IMPV11']
These can obviously be updated but I wanted to start with something
I then run tar_make(p2a_model_ids) for the 0_baseline_LSTM
Then I try to run the model with tar_make(p2a_metrics_files) and I get a long list of errors, the most relevant appears to be this one:

So it looks like there is a mismatch between input size and what is expected, but I am having trouble finding where this is stemming from. I'll keep digging, but I thought I would put this here as an update

jsadler2 commented 2 years ago

@galengorski - Could it be that the new data you are giving the 0_baseline_LSTM models to produce the results result in a different size matrix than what the models were originally trained with? I think this error has something to do with that ... that the data the models are trained with are different than the updated data you are using to make the predictions. I'm having a bit of a hard time troubleshooting beyond that. Can you tell me what you think those matrix sizes are? Is it [num_variables, num sites] or something like that? Also, it might be helpful to see more of that error traceback to see where in our code that is taking place. I'm 90% sure it's in the "predict" step, but it'd be nice to know for sure.

galengorski commented 2 years ago

Hey @jsadler2 ! That was my first thought too, so to quickly check, I used the same number of variables from the new driver set and I got the same error with the same dimensions in the error message. My first thought was that the dimensions were [num_variables,num_sites] at first as well, but now I am not so sure. Tallgrass is down for maintenance today, but I was going to dive back into this tomorrow (7/19), so I should have some more info then.

galengorski commented 2 years ago

Ok I think I got this working ok, PR: #139. One change that had to be made was updating the variable names as those were different from NHM to NHD and some of them didn't have a perfect match:

NHM_var -> NHD_var

seg_ccov -> ?? seg_rain -> pr seg_tave_air -> tmmx (need to change this, but using max temp for now) seginc_swrad -> srad hru_slope -> CAT_BASIN_SLOPE hru_aspect -> ?? hru_elev -> CAT_ELEV_MEAN hru_area -> CAT_BASIN_AREA hru_pct_impervious -> CAT_IMPV11 cov_den_sum -> CAT_CANOPY_BUFFER cov_den_win -> ?? soil_moist_max -> CAT_TWI

galengorski commented 2 years ago

Here is are the initial results from the switch over from NHM to NHD: nhm_nhd

Of the 6 val sites 4 showed improvement using the NHD fabric over NHM
the biggest improvements seem to be in do_max
the model seems to train about the same on the two fabrics but generalize better with NHD, this is consistent with the idea that the NHM-based model couldn't differentiate between sites on the same segment, in the way that the NHD-based model now can

lekoenig commented 2 years ago

One change that had to be made was updating the variable names as those were different from NHM to NHD and some of them didn't have a perfect match

Yeah, for the baseline model I tried to match up a "sparse" set of features that more or less match what we were using before. However, I didn't really have NHD-scale analogs for the two variables you have marked with ??? above. seg_ccov is an output variable from the process-based model PRMS-SNTemp and represents the area-weighted average cloud cover fraction for each segment. cov_den_win is a parameter in the national hydrologic model (NHM) that represents the "percent shaded in winter". I've always assumed that meant percent of the segment that is shaded, but maybe it's more likely that percent shaded is in reference to the catchment.

How easy is it to update these variable names in the code? For example, if we decided to add another feature like percent forest data, would that be easy to implement? Or would we have to tinker with the variable names again?

galengorski commented 2 years ago

This one compares the NHD-based model on well observed sites and medium observed sites: nhm_nhd_med_obs

even though there are only two new sites that add a total of 465 new observations to the ~54,000 that we already have in the model, it looks like it really improves the validation performance for 01472104 and 01481500
01472104 is on the upper Schuylkill and 01481500 is on Brandywine creek
both of the new sites 014721254 and 014721259 are at higher elevation (167 and 213 m(?)), while the highest elevation site in the well observed data was 98 m (?), but otherwise the new sites had similar static attributes to the existing sites
the two sites that showed the greatest improvement with adding in theses new sites ( 01472104 and 01481500) were not higher in elevation though, they had elevations of 55 and 59 m

galengorski commented 2 years ago

How easy is it to update these variable names in the code? For example, if we decided to add another feature like percent forest data, would that be easy to implement? Or would we have to tinker with the variable names again?

If new variables can get written to the well_obs_io.zarr (or med_obs_io.zarr) file through the p2a_well_obs_data target then it's really easy to adjust which variables that the model sees. I haven't dug around too much upstream of the p2a_well_obs_data target.

galengorski commented 2 years ago

These new sites make me think that we might do even better if we lowered our threshold for observations to include even more sites that only had a few observations...

lekoenig commented 2 years ago

These new sites make me think that we might do even better if we lowered our threshold for observations to include even more sites that only had a few observations...

Yeah, interesting. I have not re-pulled data from NWIS since June, but absent any big changes there, it looks like we'd grab a total of 18 sites if we cut our criteria at 50 days of data (we don't gain any sites by lowering that threshold further below 50 days). p2_med_observed_sites currently contains 16 sites and p2_well_observed_sites contains 14 sites.

> tar_load(p2_sites_w_segs)
> p2_sites_w_segs %>%
+   filter(count_days_nwis >= 50) %>%
+   pull(site_id) %>%
+   unique() %>%
+   length()
[1] 18
>

300 and 100 days is an arbitrary cutoff for defining well- and medium-observed sites, respectively, so I don't really see an argument against lowering our threshold to 50 days. You could even test that out locally by changing the threshold value in 2_process.R.

galengorski commented 2 years ago

I ran the model quickly with the threshold at 50 days like you suggested. It doesn't look like it has a huge effect, but it does seem to further draw down the RMSE values of do_max at the validation sites. It's also interesting how poorly that new site trains for do_max, I haven't dug into the new sites at all though. At some point, we'd have to see how robust these results are to different repetitions of the model. nhd_nhm_med_min_50

lekoenig commented 2 years ago

Thanks for looking at that! I took a quick look at those two sites - they're both sites in NJ with ~1.5 months of data in fall 2008. I've seen short deployments like that for other water quality constituents before, for example, if a WSC is conducting a focused study over some discrete period of time. No other site information is given on the NWIS page. Given how poorly 01482537 trains on do_max and to a lesser extent, do_mean, would you recommend sticking with our 100-day threshold for now?

galengorski commented 2 years ago

I think that's a good idea, I like 100 as a threshold.

USGS-R / drb-do-ml

Run model with medium observed sites #122