Closed janetrbarclay closed 3 years ago
I don't actually know if the adj. matrix is the problem that you are running into, but that is a problem. That problem (subsetting the adj. matrix) is actually addressed in an open PR: #120 . (see this line: https://github.com/USGS-R/river-dl/blob/ab337e5dab9e3b7ff6f333152304adb0ff322108/river_dl/preproc_utils.py#L872). Simon's provided comments on the PR, but I haven't had the time to look over them yet.
Nice! That approach is exactly the sort of thing I was thinking of. I'll try his fix on my fork and see if that solves my error.
using the fix from #120 (now I'm realizing that was your fix, not Simon's, sorry about that!) did solve my error, but along the way I noticed that with the DRB subset dataset, the rowcolnames in the distance matrix are not the seg_id_nat's (they are with the full DRB dataset) Is there a repo where that preprocessing is done where I should post that issue?
The full dataset
The subset dataset
Ah. Draticans. That preprocessing is done in the https://github.com/USGS-R/delaware-model-prep repo. It's a bit of a behemoth, that pipeline is. It might be easiest to just translate from whatever id those are (maybe subseg_id
?) to seg_id_nat
with the data that you have already and then overwrite that array in the .npz file. Does that make sense?
@aappling-usgs was the one who wrote the code to prep the distance matrices, so she may be able to better say whether it's better to try to address this upstream (in the pipeline) or just convert id's posthoc and overwrite.
yes, that makes sense. I don't actually need to use that from the subset so there's no rush on my behalf. The current best groundwater model is upstream of Montague, so I'll be subsetting from the full dataset. I was just using the subset dataset for prototyping b/c that works in pangeo.
One other thought that just occurred to me. As long as you are okay using the seg_id_nat
s, you should (as long as the code works like we think it is) be able to just pass in that distance matrix and the list of segments that you are working with. Since the full distance matrix has all of the possible sites you'd be looking at, you can just subset from that and the "distance_matrix_subset.npz" file becomes obsolete.
oh yeah, good thought. Thanks!
Looks like you have a solution, but responding to Jeff's call-out 3 messages ago: I think it'd be doable to relabel the subsetted distance matrix with seg_id_nats, but preferable to subset from the full distance matrix anyway because that's in the data release already. Also preferable because I'm supposed to be on leave already today and will definitely be on leave next week. Still, let me know if some edits after next week would be helpful.
In the
prep_data
in preproc_utils.py, it appears that the input data can be filtered to a subset of segments at runtime.https://github.com/USGS-R/river-dl/blob/0c78af242ca03010080bc7e43ca97d17cef6eda8/river_dl/preproc_utils.py#L503-L504
https://github.com/USGS-R/river-dl/blob/0c78af242ca03010080bc7e43ca97d17cef6eda8/river_dl/preproc_utils.py#L539-L540
Is that the intended functionality? When I try it I get an error during pretraining that the dimensions aren't equal and I'm wondering if that's because the segment filtering doesn't extend to the adjacency matrix. (one set of dimensions is the length of the shorter segment list and one is the length of the full segment list)
https://github.com/USGS-R/river-dl/blob/0c78af242ca03010080bc7e43ca97d17cef6eda8/river_dl/preproc_utils.py#L601
If adjusting the included segments at runtime was the intention and not filtering the adjacency matrix is the issue, I'd be happy to address that and do a PR.