PeterRochford / SkillMetrics

A Python library for calculating and displaying the skill of model predictions against observations.
GNU General Public License v3.0
195 stars 91 forks source link

check_taylor_stats not working properly #21

Closed CarolineHaslebacher closed 2 years ago

CarolineHaslebacher commented 4 years ago

I think I found an issue in 'check_taylor_stats.py' and I will try to describe it in the following.

I am using numpy version 1.18.5 and python 3.6.9 64-bit. The version of 'check_taylor_stats' I used is copied at the bottom.

I want to raise the error of incompatible data, so I define data that is clearly not fulfilling

RMSs(i) = sqrt(STDs(i).^2 + STDs(1)^2 - 2*STDs(i)*STDs(1).*CORs(i)).

So if I define

STDs = [1,5,3,2]

CRMSDs = [1,7,6,4]

CORs = [6,8,3,1]

, which are some random values in a list, then my 'difference' list is

array([103., 44., 15])

So we get differences much above our threshold of 0.01. But as far as I understand it, we want the difference to be close to zero, so that we can be sure that

RMSs(i) = sqrt(STDs(i).^2 + STDs(1)^2 - 2*STDs(i)*STDs(1).*CORs(i)) is true.

Next, the 'index' for these differences is

(array([0, 1, 2]),)

, which is clearly not empty.

But the function will only raise an error 'if not index', so if the 'index' is an empty list. So it doesn't raise an error for my random data. Am I overlooking something?

In the other direction, if my differences are very small, so that I can be sure that the data is correct, for example with:

STDs = [1.0, 1.4325136021271947, 1.0617690033008276, 1.4678603104077521, 1.1292483168466108, 1.9391992365319153, 0.7424689187296046, 1.190407099774292] CRMSDs = [0.25735399231579914, 0.4455591033096046, 0.5688032547358672, 0.8937130204486814, 0.6372916529557612, 1.3136761418162093, 0.6242831508114802, 0.800117230131112] CORs = [1.0, 0.9960018185865716, 0.8494391285508488, 0.8024915283990488, 0.8275686942766974, 0.7824746978593649, 0.7822082605692278, 0.7463335364755743]

I get the differences

array([3.63507139e-15, 8.57879111e-16, 1.66799715e-15, 2.32355360e-15, 7.71995646e-16, 4.27305282e-16, 8.67107592e-16])

which are all well below the threshold.

But now I think this is part of the problem; the variable 'index' looks like this

(array([], dtype=int64),)

which also doesn't count as empty, because of its shape, so 'if not index' also is not the case. So I am not able to raise the error so far. I can only raise it if I define

index = []

I would change if not index to

if len(index[0]) != 0: , but I see that this has also shortcomings, if somehow 'index' even so is a list and not an array.

But for now, it will raise an error for me if one of the differences is above the threshold.

The function 'check_taylor_stats.py' I used:


import numpy as np

def check_taylor_stats(STDs, CRMSDs, CORs, threshold = 0.01):

'''

Checks input statistics satisfy Taylor diagram relation to <1%.

Function terminates with an error if not satisfied. The threshold is

the ratio of the difference between the statistical metrics and the

centered root mean square difference:

abs(CRMSDs^2 - (STDs^2 + STDs(1)^2 - 2*STDs*STDs(1)*CORs))/CRMSDs^2

Note that the first element of the statistics vectors must contain

the value for the reference field.

INPUTS:

STDs : Standard deviations

CRMSDs : Centered Root Mean Square Difference 

CORs : Correlation

threshold : limit for acceptance, e.g. 0.1 for 10% (default 0.01)

OUTPUTS:

None.

Author: Peter A. Rochford

Symplectic, LLC

www.thesymplectic.com

prochford@thesymplectic.com

Created on Dec 3, 2016

'''

if threshold < 1e-7:

raise ValueError('threshold value must be positive: ' + str(threshold))

diff = np.square(CRMSDs[1:]) \

- (np.square(STDs[1:]) + np.square(STDs[0]) \

- 2.0*STDs[0]*np.multiply(STDs[1:],CORs[1:]))

diff = np.abs(np.divide(diff,np.square(CRMSDs[1:])))

index = np.where(diff > threshold)

if not index:

ii = np.where(diff != 0)

if len(ii) == len(diff):

raise ValueError('Incompatible data\nYou must have:' +

'\nCRMSDs - sqrt(STDs.^2 + STDs[0]^2 - ' +

'2*STDs*STDs[0].*CORs) = 0 !')

else:

raise ValueError('Incompatible data indices: {}'.format(ii) +

'\nYou must have:\nCRMSDs - sqrt(STDs.^2 + STDs[0]^2 - ' +

'2*STDs*STDs[0].*CORs) = 0 !')

return diff
xiuyongwu commented 2 years ago

Hi, I think I had the same problem when using the keyword checkStats in sm.taylor_diagram(sdev, crmsd, ccoef, checkStats='on'). For example, when I give some random values to sdev, crmsd, ccoef like:

sdev = np.random.rand(4)
crmsd = np.random.rand(4)
ccoef = np.random.rand(4)

and then plot the taylor_diagram: sm.taylor_diagram(sdev, crmsd, ccoef, checkStats='on') The sentence seems to work fine but apparently the sdev, crmsd and ccoef does not fulfill the equation RMSs(i) = sqrt(STDs(i).^2 + STDs(1)^2 - 2*STDs(i)*STDs(1).*CORs(i)).

This suggests that the program does not check the quality of sdev, crmsd and ccoefdespite we set the keyword checkStats='on'.

After changing the sentence if not index to if len(index[0]) != 0:, the keywrod checkStats works indeed and it comes up with an error You must have: CRMSDs - sqrt(STDs.^2 + STDs[0]^2 - 2*STDs*STDs[0].*CORs) = 0 ! if we give some random values to sdev, crmsd, ccoef.

Thanks for your contribution.

PeterRochford commented 2 years ago

You are correct. This is a bug that I've now fixed by using the condition if np.any(index):