Unidata / MetPy

MetPy is a collection of tools in Python for reading, visualizing and performing calculations with weather data.
https://unidata.github.io/MetPy/
BSD 3-Clause "New" or "Revised" License
1.25k stars 414 forks source link

Avoid implicit .compute() calls on dask arrays #1615

Open rpmanser opened 3 years ago

rpmanser commented 3 years ago

dask implicitly calls .compute() on an array used in a conditional statement. This is not desirable when chaining together metpy functions with dask arrays as inputs. An example of this scenario is in dewpoint_from_relative_humidity:

def dewpoint_from_relative_humidity(temperature, relative_humidity):

    if np.any(relative_humidity > 1.2):
        warnings.warn('Relative humidity >120%, ensure proper units.')
    return dewpoint(relative_humidity * saturation_vapor_pressure(temperature))

In this case I think an optional argument like check_values with a default of True would work fine. Something like:

def dewpoint_from_relative_humidity(temperature, relative_humidity, check_values=True):

    if check_values and np.any(relative_humidity > 1.2):
        warnings.warn('Relative humidity >120%, ensure proper units.')
    return dewpoint(relative_humidity * saturation_vapor_pressure(temperature))

Alternatives for similar statements in the code base will need to considered. This should be the first major step toward completing #1479.

Let's document functions from the calc module here that need modification to avoid implicit .compute() calls and discuss potential solutions.

Functions

Solutions

rpmanser commented 3 years ago

I don't think it's possible to avoid implicit .compute() calls without significantly altering the codebase to specifically alleviate this problem. I'm not sure there would be a noticeable improvement in performance either. I'm going to close this, but if someone thinks of a good solution it can be reopened and revisited.

rpmanser commented 1 year ago

Reopening this issue to explore some solutions discussed with @dcamron. First thing we can try is calling np.max() on relative_humidity, then comparing that scalar to the warning threshold to hopefully retain lazy evaluation. This brings up the question of whether or not it's okay for the warning itself to be lazy, i.e., the warning would not appear here:

dewpoint = dewpoint_from_relative_humidity(temperature, relative_humidity)

but instead here:

dewpoint.compute().

I think this is okay, as dask users should understand that anything in their task graph can throw warnings and errors and that they are responsible for tracing where the warning or error occurred in their workflow.

dopplershift commented 1 year ago

I think the delayed warning isn't ideal but, as you pointed out, is par for the course with Dask, so that's fine.