DOI-USGS / lake-temperature-model-prep

Pipeline #1
Other
6 stars 13 forks source link

mega issue: use primary identifier as NHD HR #53

Closed jordansread closed 4 years ago

jordansread commented 5 years ago

This is a conversation we've been having for awhile. We don't fully understand the effort to pivot away from the NHD med-res lake shapes that appear in the Winslow et al paper. Starting from that data release and expanding was a known shortcut at the time. Now we are realizing that this is limiting us in ways that we did not expect, and it may become a priority sooner to move to HR.

This issue is meant to capture discussion as we learn about the level of effort needed here. Welcome @jzwart @limnoliver

jordansread commented 5 years ago

unclear if this is comprehensive and how it fits w/ the data in lakeattributes, but this may be the Winslow depth sources, this may be all/part of secchi, here are the processed NLCD data

jordansread commented 5 years ago

1_crosswalk_fetch

This is where we currently pull in a bunch of different shapefiles and also fetch pre-calculated crosswalk tables. We fetch the Winslow shapefile.

Instead we'll need to fetch raw NHD_HR polygons and drop the fetching of any pre-canned crosswalks, since they are all based on medium res

2_crosswalk_munge

These are methods that take two shapefiles and figure out join IDs, or do point in polygon analysis. This is where we build new crosswalk tables in contrast to the canned ones in 1_. We also buffer the lake shapefiles in this step. Not sure why it lives here, but I think I put it here.

We'll probably do similar things here, other than changing some of the source files to correspond to changes in 1_

3_params_fetch

This is where we get all kinds of things that are lake-specific params, like NLCD, bathy, depths, clarity, etc.

With changes that will drop Winslow as the source of pre-compiled attributes/params, we'll probably need more here.

4_params_munge

These are munged attributes that would now be linked to the canonical ID (previously, nhd_{ID})

We're probably moving away from lakeattributes, so the naming of targets according to that package as a target should probably (?) change. E.g., 4_params_munge/out/lakeattributes_area.rds.ind. Changes in this phase are probably otherwise minimal.

6 and 7 (drivers)

These should be fine, as they would update w/ the shapes and centroids coming out of 1_

7a_temp_coop_munge

This uses crosswalks to connect coop data to our canonical IDs.

We'll need to adjust these w/ changes to crosswalks.

8_viz

This uses crosswalks to connect coop data to our canonical IDs.

We'll need to adjust these w/ changes to crosswalks.

limnoliver commented 5 years ago

RE: 4_params_munge -- yes, many of these will be the same, but will take on new names that don't infer they're being formatted for lakeattributes. There will also need to be additional targets in this step, that pull in data that currently resides in lakeattributes (see Jordan's comments above with links to data).

jordansread commented 5 years ago

I've got a working fetcher/processor for the NHD files. But the Permanent_, which is what I'm pretty sure we want for site_id has some goofy IDs. I don't think this will cause issues, but something to keep an eye out for. Some IDs are {0014DC77-4688-435F-9EFA-7F056F47D349} compared to the more common 120017988 formats

sf_waterbodies$Permanent_ %>% as.character %>% nchar %>% table
.
    8     9    36    38 
46146 63277   489  2212 
jordansread commented 5 years ago

I'm also combining the DL, filter, mutate stuff all into a single function call for the task table solely to cut back on redundant copies of NHD HR stored locally. Ideally, I'd 1) DL, 2) filter/process, then 3) merge all. But DL and filter/process are combined into a single step. Kind of a pain because this means every time we change how we filter (or add/remove a lake from keep_IDs or remove_IDs), we will have to download all of the files again.

Wondering whether this was a bad idea to combine the two steps...

update

Even though the zip file for NHD HR for the state of WI is ~500MB, if I filter down to only lakes/ponds/impoundment waterbody features and save as an .rds file, it is only 65MB. Maybe that is worth keeping around? MN may be twice as big, but the other states would be smaller.

jzwart commented 5 years ago

👍 on the Permanent_ as site_id. NHD (not plus) uses PERMIDs as identifier while NHDPlus uses COMIDs as identifier. Waterbodies that have COMID assigned will retain that as their PERMID so I think we should be OK switching from NHD to NHDPlus

jordansread commented 5 years ago

Thanks Jake. I wonder if lakes that don't have COMIDs are the ones that have the long char Permanent_

jzwart commented 5 years ago

That was my guess but it would be nice to know for sure

jzwart commented 5 years ago

it is only 65MB. Maybe that is worth keeping around?

That seems like a reasonable file size to have around.

jordansread commented 5 years ago

total sf object file size (as .rds file) is 230MB for the 8 states, filtered down to lakes > 4 ha. Seems reasonable as a starting point. We'll add ways to get the keep_IDs and remove_IDs implemented

A gotcha I ran into: seems the shapefiles from ftp://rockyftp.cr.usgs.gov/vdelivery/Datasets/Staged/Hydrography/NHD/State/HighResolution/ are truncated to 200K features(!), cutting off lakes in the dakotas, MN, and maybe some other states that I didn't check. I switched to GDB files and it seemed to resolve this issue, but I found it surprising.

update

Kelsey noticed that shapefiles that contain over 200K features are broken up into multiple files. I didn't catch that...

jordansread commented 5 years ago

Processing the WQP to NHD HR crosswalk yields 50,278 unique monitoring locations. The old one was 3,700 locations (!) from back in May. Difference could be a combination of new sites added to WQP (but probably not that many) and the update to NHD high-res. We should be on the lookout for a lot of these sites being shoreline, which may not be representative of the lake on average.

jordansread commented 5 years ago

Updates:

1_crosswalk_fetch

This is in pretty good shape, but I think we'll want two finalize targets at the end of the NHD HR task table, one that includes attributes, and the other that just has the merged shapefiles. We can use the former to propagate GNIS names through for later use in the visualize stage. We also don't have a fetcher yet for the WIDNR hydrolayer or the winslow shapefile. We'll need the WIDNR file to create WBIC crosswalks in 2_ to do:

2_crosswalk_munge

We've got this pretty far along too, but Kelsey is going to revise the poly/poly crosswalk function. When we have an updated polygon crosswalk function, it might make sense to go back to 1_ and filter out the great lakes. Not a big deal, but we aren't going to model those and they end up w/ a lot of wqp sites and data that we are carrying along for no reason. I was avoiding doing that right now because I didn't want to re-calc all of the crosswalks (it is slow right now). to do:

3_params_fetch

In good shape, just want to track down more depths if we can. Also, this stage doesn't include WQP secchi right now, but maybe it should. That appears later in 6_temp_wqp, but secchi could be viewed as either a param or a driver, so maybe I am splitting hairs here...

4_params_munge

So far so good, but by the end of this stage we should probably have single tables (or lists in the case of hypsography) for params that go into glm.nml. Haven't dealt with that yet.

There are also some better functions available for the NLCD calculation (I think #69 covers this).

6 and 7 (drivers)

Haven't touched this yet for NLDAS, but working on revising the WQP part of 6_.

7a_temp_coop_munge

This uses crosswalks to connect coop data to our canonical IDs.

We'll need to adjust these w/ changes to crosswalks.

8_viz

We'll need to adjust these w/ changes to crosswalks.

jordansread commented 4 years ago

Calling this done