CADWRDeltaModeling / dms_datastore

Data download and management tools for continuous data for Pandas. See documentation https://cadwrdeltamodeling.github.io/dms_datastore/
https://cadwrdeltamodeling.github.io/dms_datastore/
MIT License
1 stars 0 forks source link

Proposal: Modeling Data subdirectories are a Drop Box #52

Open dwr-psandhu opened 1 month ago

dwr-psandhu commented 1 month ago

Dropbox

Proposal 1: Modeling Data subdirectories are a Drop Box

In this proposal, you can put things anywhere in Modeling_Data as long as: • you can point to a reader that reads data, applies provider flags the way you want, transforms it into a dataframe . • filenames sort lexicographically, • you need to make a small entry in recpies/data_recipes.yaml describing how to read it and a few pieces of metadata. • checker will be provided. Nightly they will be swept into /formatted and thereafter they are safe, although whether the raw is safe or not is kind of up to users.

Use cases:

  1. Mokelumne. Populated by usgs and two type of ebmud.
  2. Daily data that can be downloaded: This would not be included if a daily downloader can do it in roughly the same way we now do the continuous data.
  3. CCF gates. This is provided to us in a subfolder automatically by the SCADA people. Continuous but not regular. I derive a simpler series that is even more irregular but sparser and distills the information in a useful way.
  4. Banks pumping. This is grabbed opportunistically and considerably transformed from pumping switches to flow in CFS
  5. Unofficial data from official sources: Often we get data from the flow/WQ groups at NCRO that they don't want to publish officially but that we can describe as a short term station. Often these are "cross program" collections – for instance stage data collected by the flow group. They are acquired during projects or over email. They may or may not be maintained long term.

The proposal is that these can be put in /dropbox/data but also anywhere on Modeling_Data Modeling_Data

The crux is data_recipes.yaml, the purpose of which is to do the following:

  1. Make sure we know what is/has been swept into our repository
  2. Make it easy to update stray data by adding more
  3. Connect the entries to enough metadata and standardization
  4. Address how possibly-overlapping updates work.
  5. Allow the user to launch a checker:
dwr-psandhu commented 1 month ago

I glanced through it. It seems straight forward enough and should scale.

What if I ask you to conceptually think of the raw download the same way ?

For example if I do a download cimis and drop in the dropbox or any of the existing noaa, usgs sources it should work the same way

What do you think?

dwr-psandhu commented 1 month ago

@esatel responded to the above

Well, you could certainly drop one of those files and name "read_ts" as the reader. This would be very helpful, since read_ts is aware of a lot of quirks. We could abstract out the "reformat" stuff as separate functions ... would take some effort, but it is a good move.

One big difference is that for most of the big agencies and programs it is "worth it" to work through all the quirks because you get 1000 time series for your effot. Maybe that would be true for CIMIS I'm not sure. You've requested instructions on how to add another data vendor and I think that would be a good thing to look at – now that I know what things might be tweaked, I've been meaning to refactor a few things to expose what is needed to make it extensible.

The initial drop box idea was focused on the case where it is not worth it ... maybe Dave Huston gave us a couple extra series or something like that. So I wasn't planning to provide code to extract metadata or thinking lots of similar data coming in.


There are some cases in between though. For instance the very old USGS data from the Aquarius system still, I believe, does not come across in NWIS. It could be that we can offer EITHER a function to reformat or some explicit metadata and it would handle both the "institutional" and "one off" cases.


The easiest way to approach this is probably to orthogonalize the two so we don't have to change the raw approach right away.