NOAA-OWP / hydrotools

Suite of tools for retrieving USGS NWIS observations and evaluating National Water Model (NWM) data.
Other
54 stars 13 forks source link

As a User, I would like HydroTools to expose public api tooling at the subpackage level #177

Open aaraney opened 2 years ago

aaraney commented 2 years ago

Currently, we are a little loose with the way we organize our "public" apis'. For example, in hydrotools's metrics subpackage, a user must do the following to access the "public" functions:

from hydrotools.metrics import metrics

mean_se = metrics.mean_squared_error(...)

This can lead to confusion when "non-public" methods are exposed in subpackage submodules alongside "public" methods:

from hydrotools.events.event_detection import decomposition as ev

# likely incorrect api call
events = ev.event_boundaries(...)

The aforementioned user story is from @Castronova's experience (please chime in with further context as you see fit).

This movement to flatter python package structures is apparent in many large python libraries (e.g. pandas). With this in mind, my point is not to just copy what others are doing even if these practices are becoming idiomatic. Instead, I would like to talk about the pros and cons of moving towards flattening subpackage structures to expose api's at the subpackage level.

Given our scientific audience, I want to avoid introducing ambiguity into what someone might infer a method does. For example:

from hydrotools import events as ev

# This is ambiguous about the method of event detection being used.
events = ev.event_detection(...)

the above code snippet is not clear enough IMO. However, I think there is a middle ground where we adhere to scientific integrity and maximize usability.

from hydrotools import events as ev

# I think this toes the line between specificity and usability well
events = ev.decomposition(...)

Much of what I am trying to describe sklearn concretizes in their package layout. There is an initial categorization in their subpackage structure (this mirrors ours) with the majority of "public" methods being exposed in a flat structure at the subpackage level.

I don't want to talk implementation detail (yet, if required), but instead I'd rather hear what others think and where you all deem it appropriate and inappropriate. Im looking forward to hearing your thoughts.

Pinging a few people who might want to provide feedback (@jarq6c, @hellkite500, @christophertubbs, @amaes3owp, @ZacharyWills, @jameshalgren)

aaraney commented 2 years ago

related to #141

jarq6c commented 2 years ago

This is a bit too abstract for me and could use more concrete examples (ha!). It seems we could be discussing multiple related issues. I generally agree flat package structures are better than deep. But as it pertains to public and private APIs, I'm not clear on the distinction. Is event_boundaries a public or a private API? I'm not sure.

If the issue is that hydrotools.events is too deeply nested, this seems like a architecture and style (pattern) issue as much as an interface issue. Were I to right that package today, I might use a strategy pattern via some kind of hydrotools.events.EventDetector abstract class. That seems flatter. Is that that what you meant?

aaraney commented 2 years ago

This is a bit too abstract for me and could use more concrete examples (ha!).

😏

But as it pertains to public and private APIs, I'm not clear on the distinction. Is event_boundaries a public or a private API?

Ah right, I should have been more complete and included a definition of sorts. My use of public vs private api in this context is to mean functions, methods, and constants that are or aren't exposed to users of hydrotools. Python lacks access modifier keywords like private, public, and protected as common in other OOPLs (as im sure you are aware, just trying to provide background). Idiomatic python uses a single _ prepended to methods, functions, and constants that are deemed "private." Often however I find this clunky and reduces code readability, so personally I try to limit my usage of _ prepended to objects (however, I am open to changing my opinion).

I think what I am really trying to get at is, the most frequently used (this is subjective and will require collective agreement) tools (here tools meaning classes and methods) should be the first a user is presented with. So, as you were alluding to in your comment, @jarq6c, my critiques are more about code organization and code style.

In the case example of hydrotools.events.event_detection.decomposition a user will likely have to go to documentation to determine what method they need instead of letting the api guide them to the method they will most likely use, list_events. IMO several functions in the hydrotools.events.event_detection.decomposition submodule could be considered "helper" functions (mark_event_flows, find_local_minimum, event_boundaries, etc.). The aforementioned methods are specific to the decomposition method and are helpful for demonstration and likely reusable, however it might "make more sense" to move them to a separate submodule. Possibly, hydrotools.events.event_detection.decomposition_utils? The purpose of this is to make functions to clearly distinguish functions, methods, and constants that are seldom used. Does this example and more thorough explanation resolve original, nebulous, comment? If not, please let me know where I can clarify or attempt to ha!

jarq6c commented 2 years ago

I think I see what you mean, but I also think further discussion would be helpful. The "helper" methods you listed in this example will be necessary for many forms of event detection (not just decomposition). It's true that list_events is the primary entry point. However, I don't see the "helper" methods as necessarily "private."

To be clear, the events package ought to be re-organized. It was written before I even knew what a software design pattern was. However, I'm not sure what the agenda for this ticket should be?

Is this issue with the events package specifically or with all the subpackages? Is the main issue that events is too deeply nested?

I've been meaning to add Eckhardt's filter to events for some time, so this may be an opportunity to implement these concepts and demonstrate how the various "helper" methods can be applied by different algorithms.

jarq6c commented 7 months ago

Why not resurrect a two year old issue?

I thought I would add the method below to events for hydrograph baseflow separation. However, it might be a good opportunity to implement some of the points raised on this issue. Thoughts on where this method should live and how we get it there @aaraney ?

from hydrotools.nwis_client.iv import IVDataService

import pandas as pd
import matplotlib.pyplot as plt

def separate_baseflow(
        series: pd.Series,
        output_time_scale: pd.Timedelta,
        recession_time_scale: pd.Timedelta = "1D",
        minimum_recession_length: pd.Timedelta = "5D",
        random_error: float = 0.01
    ) -> pd.Series:
    """Applies the digital recursive filter from Eckhardt (2005)
    to series and returns a baseflow time series. A linear reservoir recession
    constant (a) is estimated using the method found in Eckhardt (2008).
    The maximum baseflow index (BFI_max) is computed using the reverse filtering
    method found in Collischonn & Fan (2013).

    Args:
        series (pandas.Series): float Series of streamflow values with a DateTimeIndex.
        output_time_scale (pandas.Timedelta): Desired output time scale of the
            resulting baseflow time series.
        recession_time_scale (pandas.Timedelta): Time scale over which to evaluate
            potential baseflow recessions. Defaults to '1D' as in Eckhardt (2008). This
            parameter can be informed by catchment scale and "signal-to-noise ratio."
            Smaller catchments with faster responding streams may be able to use
            a shorter recession_time_scale (say '12h'). However, smaller values will
            be more susceptible to random fluctuations in streamflow.
        minimum_recession_length (pandas.Timedelta): Minimum length of time to
            consider when determining potential recessions. This is the amount
            of time over which consecutive streamflow values should all decrease.
            Defaults to '5D' as in Eckhardt (2008). Should be some multiple of 
            recession_time_scale, 3 or 5 times the recession_time_scale works
            for many cases.
        random_error (float): Expected random error of a single streamflow value
            at the recession_time_scale. Defaults to 1% (0.01) as in Eckhardt (2008).

    Returns:
        pandas.Series: A separated baseflow time series derived from the input at
            the desired output_time_scale.

    """
    # Identify recessions
    resampled = series.resample(recession_time_scale).first().interpolate()
    difference = resampled.diff()
    is_recession = difference.rolling(
        window=minimum_recession_length,
        center=True).apply(lambda w: (w < 0.0).all())
    previous = resampled.shift(1)

    # Fit line through origin and
    # "upper bound" of data to
    # estimate recession constant (a)
    # This assumes a linear reservoir
    x = previous[is_recession == 1.0]
    y = resampled[is_recession == 1.0]
    recession_constant = max(y / x) * (1.0 - 2.0 * random_error)

    # Convert slope to output time scale
    periods = pd.Timedelta(recession_time_scale) / pd.Timedelta(output_time_scale)
    recession_constant = 1 - ((1 - recession_constant) / periods)

    # Estimate maximum baseflow index (BFI_max)
    baseflow = series.resample(output_time_scale).first().interpolate()
    maximum_baseflow = baseflow[::-1]
    for idx in range(1, maximum_baseflow.size):
        maximum_baseflow[idx] = min(maximum_baseflow[idx-1] / recession_constant, maximum_baseflow[idx])
    max_baseflow_index = sum(maximum_baseflow) / sum(baseflow)

    # Estimate baseflow
    for idx in range(1, baseflow.size):
        estimate = (((1 - max_baseflow_index) * recession_constant * baseflow[idx-1] +
            (1 - recession_constant) * max_baseflow_index * baseflow[idx]) /
            (1 - recession_constant * max_baseflow_index))
        baseflow[idx] = min(estimate, baseflow[idx])
    return baseflow

# Get observed data
site_code = "01646500"
service = IVDataService()
observed = service.get(
    sites=site_code,
    startDT="2023-01-01",
    endDT="2024-01-01"
    )

# Preprocess data
observed = observed[['value_time', 'value']]
observed = observed.drop_duplicates(['value_time'])
observed = observed.set_index('value_time')

# Baseflow separation
observed = observed.resample("15min").first()
observed["baseflow"] = separate_baseflow(observed["value"], "15min")
observed.plot()
plt.show()

baseflow