Feature Request: Calculate Total Spread

icastorm commented 2 months ago

Issue

An important function of the DART diagnostic toolkit is to compare the RMSE to the "total spread", which is defined in src/pydartdiags/obs_sequence/obs_sequence.py as the sqrt(sum(sd+obs_err_var)). Given the usefulness of the total spread value in a variety of contexts (temporal evolution, comparisons between observation types, vertical distribution, etc.), it seem appropriate to add a function that calculates the "total spread" of some set of observations to the plots.py script. Total spread plots could be added later as well.

Solution(s)

To make calculating the total spread easier, the observation error variance should be added as a column to the dataframe created by the obs_sequence class. This would probably be a useful addition regardless.
A function should be added to the plots.py script that, given a dataframe of observations with columns for observation error variance ensemble variance returns the "total spread" value given by sqrt(sum(ensemble_sd+obs_err_sd)). Here, ensemble_sd and obs_err_sd are the square roots of ensemble variance and observation error variance respectively.

Testing

To ensure parity with the matlab diagnostic code base, total_spread values should be compared to those in the output of dart_diags

hkershaw-brown commented 2 months ago

Thanks @icastorm great request, currently totalspread is missing.

Here is how I was thinking about it with rmse and bias (I was focusing only on these for a demo of the plots!):

The column is single observation calculation, so squared error and bias are created as columns when we create the dataframe https://github.com/NCAR/pyDARTdiags/blob/7d1b167fb3cbe0d6bb8c33f02cffdb4274889fe0/src/pydartdiags/obs_sequence/obs_sequence.py#L86-L89

sq_err = (mean-obs)**2 bias = mean-obs

For rmse, this is over a group of observations, so you select the group of observations and get the rmse and bias for that group of obs. rmse = sqrt( sum((mean-obs)**2)/n ) bias = sum((mean-obs)/n

https://github.com/NCAR/pyDARTdiags/blob/7d1b167fb3cbe0d6bb8c33f02cffdb4274889fe0/src/pydartdiags/plots/plots.py#L133-L137

I think you're correct that we can treat totalspread in the same way. The function to calculate totalspread is the way to go.

obs_err_var is there as a column in the dataframe. There may be something funky going on if you are not seeing an 'obs_err_var' column

Longer term, I think we might want to split the diagnostic calculations into their own module - I'm guessing someone might want the calculations without necessarily making the plot.

icastorm commented 2 months ago

Sorry for the radio silence, its been a busy couple of weeks and I've been on a bit of a time crunch. I will hopefully be back to working on this next week though...

NCAR / pyDARTdiags