SimpleLab-Inc / wsb

USA Water Service Boundary data assimilation and prediction
21 stars 5 forks source link

Modify workflow to allow individual contributions of Tier 1 #117

Open ksonda opened 2 years ago

ksonda commented 2 years ago

There may be an upcoming activity prioritizing the harvesting of Tier 1 boundaries from the remaining "Very Large" systems, and these should be able to be integrated without too much fanfare into the existing workflow.

A proposal:

  1. Set up /contributions-tier1/{state}subdirectories.
  2. Authorized contributors will place in each folder individual {st}{pwsid}.geojson files of Tier 1 boundaries
  3. Add or Modify src/transformers/states/transform_wsb_{st}.R as appropriate to merge in these new Tier 1 boundaries prior to the match and modeling steps
jess-goddard commented 2 years ago

Thanks for this suggestion, @ksonda.

I agree fully with points 2-3: having individual state transformer files from new Tier 1 boundaries will be critical. This should be easy to incorporate as new data becomes available. Given that you're suggesting individual pwsid boundaries rather than state level, I think your suggestion that we modify the original state transformer to incorporate the one-off pwsid boundaries is straightforward and should be implemented when we have that data. Detailed commenting and a modified developer guide can support this change pretty seamlessly.

My recommendation for point 1 is that, rather than have subdirectories that maintain the data on github, we request states to host their own FTP or Drive folder or site where we can pull data from a reliable/ maintained URL. This is the current work-flow arrangement, where all incoming data is brought in from upstream sources. This ensures 1) upstream data has a clear/reproducible source; 2) there are not conflicts between github data and state/agency maintained data as changes happen over time; and 3) we do not risk hitting file size limits of github (100MB–unlikely for individual pwsids, but I could easily see a state offering a smaller subset of data with many pwsids).

In short, the repository is designed to ingest/transform/load, but not store and maintain external data–which is a formidable task to do well to ensure the data remains current and accessible beyond the repository.

ksonda commented 2 years ago

Thanks @jess-goddard. I see that there are good reasons to separate this repository from data storage. Regarding "states host their own FTP" recommendation, I agree fully with that for large aggregations that might be made available by more states. The issue is that in the short term there will likely be an EPIC-led activity to source the ~200ish 'very large' systems directly from the relevant utilities, which are generally in states that do not currently have any kind of boundary collection program.

This process will require some way to provide for a publicly visible submission/ version tracking mechanism of its own, to be transparent about which individual boundaries were submitted by whom with what underlying source, so that the data can be folded over and replaced by state sources if and when that is appropriate. GitHub is as good an option as any at this scale, since

Perhaps EPIC and I need to coordinate creating a separate repo that has this directory structure. Then steps 2-3 can be implemented against those URLs

jess-goddard commented 2 years ago

@ksonda Yes I see the value here in what you're suggesting!

I like the idea of modularizing the data uploads to a small repo just for that purpose, but we can also discuss offline the pros/cons of keeping it separate from here. Let's connect when I'm back in office May 17

ksonda commented 2 years ago

I've mocked something up here https://github.com/cgs-earth/national-cws-boundary-update

ksonda commented 2 years ago

We have a contribution workflow set up here now https://github.com/cgs-earth/ref_pws

It generates/updates a geopackage here anytime a contribution is made https://www.hydroshare.org/resource/c9d8a6a6d87d4a39a4f05af8ef7675ad/data/contents/contributed_pws.gpkg

If this is of interest to ping

jess-goddard commented 2 years ago

@ksonda great we have it on our agenda to connect with you this month about an integration