CADWRDeltaModeling / dms_datastore

Data download and management tools for continuous data for Pandas. See documentation https://cadwrdeltamodeling.github.io/dms_datastore/
https://cadwrdeltamodeling.github.io/dms_datastore/
MIT License
1 stars 0 forks source link

Allow finer grained request to populate_repo #29

Open dwr-psandhu opened 6 months ago

dwr-psandhu commented 6 months ago

Currently populate_repo can only be fine grained down to agency and variable combination. This issue is to see if it is possible to finegrain the request down to each individual station.

water-e commented 6 months ago

Can you elaborate on the motive?

If it is for debugging and the main objective is to mask away anything that doesn't fit a pattern, probably it is possible to use station. I'm concerned about a few steps particularly the ncro-cdec handoff. Also I'd have to add it through a deep set of functions. But ... we're talking about a days work or so. It seems reasonably motivated when you have a debugging issue.

If it is for parallelism I'm skeptical of either the feasibility or the payoff. We don't have a "master list" of everything we're going to download, so even if such an approach were efficient we wouldn't have something to loop through. Compare that with variables and agencies. We can loop through those effortlessly. You've taken that out of the scripts and into groovy, and I'll try to understand why although I don't thing it is terribly needed and you aren't quoting fast speeds.

In the current approach, we loop through agencies and variables, which is easy, and we ask for everything at every station and see what we get ... this doesn't require upkeep or miss new stations or instruments. I don't mind a more directed approach for mission critical items or checks -- It is a great idea to have a few things we know we want to get and insist on that. But it shouldn't be the general workflow here. For SCHISM mission critical items, I just use the downloading scripts and a station list for that. They are designed for that.

I also don't see how there could be a payoff on parallelism. We're multi-processing line by line and the payoff is ... OK. The one exception was NOAA predictions and those have been removed from nightly downloads. It would be more efficient to parallelize reformat down to the half agency (stations in [a-m]. That job isn't terribly hard, and a 2.5 hour chores could probably be taken down to 0.5 hour.