Closed galengorski closed 2 years ago
I am first trying to run the 0_baseline_LSTM model before I add another model with the medium observed sites. When I run tar_make(p2a_metrics_files)
I get the following error:
it looks like an environment issue to me with the 'snakemake' not found errors. This seems like it would be solved by running on tallgrass. I haven't worked with github and tallgrass together, @jsadler2, I'm wondering if you have suggestions on what a good workflow would be. Should I push the branch that I am working on now on my local up to the github repo and then pull it into tallgrass...?
Yep. Pushing from local to GH and then pulling from GH to Tallgrass works great.
Cool thanks, what is the best way to pull from gh on tallgrass. Do you do it through RStudio?
You should clone your repo, then checkout your fork (though I don't see your fork listed, so I'll show what it would look like for mine). You can do this on the commandline. I'd recommend that over Rstudio.
cd /caldera/projects/usgs/water/impd/pump/drb-do
git clone git@github.com:jsadler2/drb-do-ml.git drb-do-ml-ggorski # putting a suffix on the clone since it's a shared folder
git checkout [your branch]
@lekoenig , here is a little update on this issue. I am running into some errors that I think are associated with the dimensions of the new NHD data compared to the older NHM inputs.
p2a_well_obs_data
target and it produces the output well_obs_io.zarr
tar_make(p2a_model_ids)
for the 0_baseline_LSTMtar_make(p2a_metrics_files)
and I get a long list of errors, the most relevant appears to be this one:
So it looks like there is a mismatch between input size and what is expected, but I am having trouble finding where this is stemming from. I'll keep digging, but I thought I would put this here as an update
@galengorski - Could it be that the new data you are giving the 0_baseline_LSTM models to produce the results result in a different size matrix than what the models were originally trained with? I think this error has something to do with that ... that the data the models are trained with are different than the updated data you are using to make the predictions. I'm having a bit of a hard time troubleshooting beyond that. Can you tell me what you think those matrix sizes are? Is it [num_variables
, num sites
] or something like that? Also, it might be helpful to see more of that error traceback to see where in our code that is taking place. I'm 90% sure it's in the "predict" step, but it'd be nice to know for sure.
Hey @jsadler2 ! That was my first thought too, so to quickly check, I used the same number of variables from the new driver set and I got the same error with the same dimensions in the error message. My first thought was that the dimensions were [num_variables
,num_sites
] at first as well, but now I am not so sure. Tallgrass is down for maintenance today, but I was going to dive back into this tomorrow (7/19), so I should have some more info then.
Ok I think I got this working ok, PR: #139. One change that had to be made was updating the variable names as those were different from NHM to NHD and some of them didn't have a perfect match:
NHM_var
-> NHD_var
seg_ccov
-> ??
seg_rain
-> pr
seg_tave_air
-> tmmx
(need to change this, but using max temp for now)
seginc_swrad
-> srad
hru_slope
-> CAT_BASIN_SLOPE
hru_aspect
-> ??
hru_elev
-> CAT_ELEV_MEAN
hru_area
-> CAT_BASIN_AREA
hru_pct_impervious
-> CAT_IMPV11
cov_den_sum
-> CAT_CANOPY_BUFFER
cov_den_win
-> ??
soil_moist_max
-> CAT_TWI
Here is are the initial results from the switch over from NHM to NHD:
One change that had to be made was updating the variable names as those were different from NHM to NHD and some of them didn't have a perfect match
Yeah, for the baseline model I tried to match up a "sparse" set of features that more or less match what we were using before. However, I didn't really have NHD-scale analogs for the two variables you have marked with ???
above. seg_ccov
is an output variable from the process-based model PRMS-SNTemp and represents the area-weighted average cloud cover fraction for each segment. cov_den_win
is a parameter in the national hydrologic model (NHM) that represents the "percent shaded in winter". I've always assumed that meant percent of the segment that is shaded, but maybe it's more likely that percent shaded is in reference to the catchment.
How easy is it to update these variable names in the code? For example, if we decided to add another feature like percent forest data, would that be easy to implement? Or would we have to tinker with the variable names again?
This one compares the NHD-based model on well observed sites and medium observed sites:
01472104
and 01481500
01472104
is on the upper Schuylkill and 01481500
is on Brandywine creek014721254
and 014721259
are at higher elevation (167 and 213 m(?)), while the highest elevation site in the well observed data was 98 m (?), but otherwise the new sites had similar static attributes to the existing sites01472104
and 01481500
) were not higher in elevation though, they had elevations of 55 and 59 mHow easy is it to update these variable names in the code? For example, if we decided to add another feature like percent forest data, would that be easy to implement? Or would we have to tinker with the variable names again?
If new variables can get written to the well_obs_io.zarr
(or med_obs_io.zarr
) file through the p2a_well_obs_data
target then it's really easy to adjust which variables that the model sees. I haven't dug around too much upstream of the p2a_well_obs_data
target.
These new sites make me think that we might do even better if we lowered our threshold for observations to include even more sites that only had a few observations...
These new sites make me think that we might do even better if we lowered our threshold for observations to include even more sites that only had a few observations...
Yeah, interesting. I have not re-pulled data from NWIS since June, but absent any big changes there, it looks like we'd grab a total of 18 sites if we cut our criteria at 50 days of data (we don't gain any sites by lowering that threshold further below 50 days). p2_med_observed_sites
currently contains 16 sites and p2_well_observed_sites
contains 14 sites.
> tar_load(p2_sites_w_segs)
> p2_sites_w_segs %>%
+ filter(count_days_nwis >= 50) %>%
+ pull(site_id) %>%
+ unique() %>%
+ length()
[1] 18
>
300 and 100 days is an arbitrary cutoff for defining well- and medium-observed sites, respectively, so I don't really see an argument against lowering our threshold to 50 days. You could even test that out locally by changing the threshold value in 2_process.R
.
I ran the model quickly with the threshold at 50 days like you suggested. It doesn't look like it has a huge effect, but it does seem to further draw down the RMSE values of do_max
at the validation sites. It's also interesting how poorly that new site trains for do_max, I haven't dug into the new sites at all though. At some point, we'd have to see how robust these results are to different repetitions of the model.
Thanks for looking at that! I took a quick look at those two sites - they're both sites in NJ with ~1.5 months of data in fall 2008. I've seen short deployments like that for other water quality constituents before, for example, if a WSC is conducting a focused study over some discrete period of time. No other site information is given on the NWIS page. Given how poorly 01482537 trains on do_max
and to a lesser extent, do_mean
, would you recommend sticking with our 100-day threshold for now?
I think that's a good idea, I like 100 as a threshold.
I am going to add a model run for medium observed sites, see if I can produce results and troubleshoot issues while Jeff is still around :(