running on a subset of segments

USGS-R / river-dl

Deep learning model for predicting environmental variables on river systems

Creative Commons Zero v1.0 Universal

21 stars 15 forks source link

running on a subset of segments #121

Closed janetrbarclay closed 3 years ago

janetrbarclay commented 3 years ago

In the prep_data in preproc_utils.py, it appears that the input data can be filtered to a subset of segments at runtime.

https://github.com/USGS-R/river-dl/blob/0c78af242ca03010080bc7e43ca97d17cef6eda8/river_dl/preproc_utils.py#L503-L504

https://github.com/USGS-R/river-dl/blob/0c78af242ca03010080bc7e43ca97d17cef6eda8/river_dl/preproc_utils.py#L539-L540

Is that the intended functionality? When I try it I get an error during pretraining that the dimensions aren't equal and I'm wondering if that's because the segment filtering doesn't extend to the adjacency matrix. (one set of dimensions is the length of the shorter segment list and one is the length of the full segment list)

https://github.com/USGS-R/river-dl/blob/0c78af242ca03010080bc7e43ca97d17cef6eda8/river_dl/preproc_utils.py#L601

If adjusting the included segments at runtime was the intention and not filtering the adjacency matrix is the issue, I'd be happy to address that and do a PR.

jsadler2 commented 3 years ago

I don't actually know if the adj. matrix is the problem that you are running into, but that is a problem. That problem (subsetting the adj. matrix) is actually addressed in an open PR: #120 . (see this line: https://github.com/USGS-R/river-dl/blob/ab337e5dab9e3b7ff6f333152304adb0ff322108/river_dl/preproc_utils.py#L872). Simon's provided comments on the PR, but I haven't had the time to look over them yet.

janetrbarclay commented 3 years ago

Nice! That approach is exactly the sort of thing I was thinking of. I'll try his fix on my fork and see if that solves my error.

janetrbarclay commented 3 years ago

using the fix from #120 (now I'm realizing that was your fix, not Simon's, sorry about that!) did solve my error, but along the way I noticed that with the DRB subset dataset, the rowcolnames in the distance matrix are not the seg_id_nat's (they are with the full DRB dataset) Is there a repo where that preprocessing is done where I should post that issue?

The full dataset

The subset dataset

jsadler2 commented 3 years ago

Ah. Draticans. That preprocessing is done in the https://github.com/USGS-R/delaware-model-prep repo. It's a bit of a behemoth, that pipeline is. It might be easiest to just translate from whatever id those are (maybe subseg_id?) to seg_id_nat with the data that you have already and then overwrite that array in the .npz file. Does that make sense? @aappling-usgs was the one who wrote the code to prep the distance matrices, so she may be able to better say whether it's better to try to address this upstream (in the pipeline) or just convert id's posthoc and overwrite.

janetrbarclay commented 3 years ago

yes, that makes sense. I don't actually need to use that from the subset so there's no rush on my behalf. The current best groundwater model is upstream of Montague, so I'll be subsetting from the full dataset. I was just using the subset dataset for prototyping b/c that works in pangeo.

jsadler2 commented 3 years ago

One other thought that just occurred to me. As long as you are okay using the seg_id_nats, you should (as long as the code works like we think it is) be able to just pass in that distance matrix and the list of segments that you are working with. Since the full distance matrix has all of the possible sites you'd be looking at, you can just subset from that and the "distance_matrix_subset.npz" file becomes obsolete.

janetrbarclay commented 3 years ago

oh yeah, good thought. Thanks!

aappling-usgs commented 3 years ago

Looks like you have a solution, but responding to Jeff's call-out 3 messages ago: I think it'd be doable to relabel the subsetted distance matrix with seg_id_nats, but preferable to subset from the full distance matrix anyway because that's in the data release already. Also preferable because I'm supposed to be on leave already today and will definitely be on leave next week. Still, let me know if some edits after next week would be helpful.