CDAT / cdutil

Climate Utilities for CDAT
3 stars 2 forks source link

Weight problems in cdutil.ANNUALCYCLE.climatology (daily data) #4

Closed chaosphere2112 closed 7 years ago

chaosphere2112 commented 7 years ago

I have found out that when using cdutil.ANNUALCYCLE.climatology on daily data, the months of different years seem to be equally weighted.

This results in (slight but visible) errors:

Test script: https://files.lsce.ipsl.fr/public.php?service=files&t=115de1c1cb57217b85776fd08fe21b23 Test data: https://files.lsce.ipsl.fr/public.php?service=files&t=c7258c48f12d3451bfa67e5585bbedef

Running the test script above will get the following output, and you can see that the difference between ANNUALCYCLEclimatology and computation by hand for all months, including unweighted February is ~1e6. But there is a bigger difference when I compute a correctly weighted February mean

python -i bug_ANNUALCYCLEclimatology.py
Variable shape = (3653, 9, 11)
Time axes go from 1959-1-1 12:0:0.0 to 1968-12-31 12:0:0.0
Latitude values = [-66.7283256  -67.84978441 -68.97123954 -70.09269035 -71.21413608
 -72.33557575 -73.45700815 -74.57843166 -75.69984422]
Longitude values = [ 120.375  121.5    122.625  123.75   124.875  126.     127.125  128.25
  129.375  130.5    131.625]
Fri Nov  6 17:35:08 2015  - Computing the climatology with cdutil.ANNUALCYCLE.climatology,
                please wait...
Fri Nov  6 17:35:26 2015  - Done!
time.clock diff = 17.36

Fri Nov  6 17:35:26 2015  - Computing the climatology by hand...
year, month, nb of selected days = 1959  2 28
year, month, nb of selected days = 1960  2 29
year, month, nb of selected days = 1961  2 28
year, month, nb of selected days = 1962  2 28
year, month, nb of selected days = 1963  2 28
year, month, nb of selected days = 1964  2 29
year, month, nb of selected days = 1965  2 28
year, month, nb of selected days = 1966  2 28
year, month, nb of selected days = 1967  2 28
year, month, nb of selected days = 1968  2 29
Difference range for month 1 = (-4.903731820604662e-06, 6.318861416332311e-06)
Difference range for February (unweighted) = (-5.2588326724389844e-06, 5.322724128120626e-06)
Difference range for February (WEIGHTED) = (-0.014045400572531008, -0.0006070483053850495)
Difference range for month 3 = (-7.192550192769431e-06, 6.8172331566529465e-06)
Difference range for month 4 = (-8.519490563685395e-06, 6.39597574547679e-06)
Difference range for month 5 = (-7.727838379878449e-06, 7.407895935784836e-06)
Difference range for month 6 = (-8.519490563685395e-06, 6.81559244952723e-06)
Difference range for month 7 = (-7.284841260002395e-06, 9.992045690410123e-06)
Difference range for month 8 = (-8.761498278886393e-06, 7.900114923131696e-06)
Difference range for month 9 = (-8.850097671597723e-06, 7.451375310552066e-06)
Difference range for month 10 = (-7.30945221505408e-06, 6.18965391652182e-06)
Difference range for month 11 = (-6.2370300426550784e-06, 6.243387851156967e-06)
Difference range for month 12 = (-6.288097772255696e-06, 6.534207223296562e-06)
Fri Nov  6 17:35:26 2015  - Finished computing the climatology by hand!
time.clock diff = 0.3
Acceleration = 57.8666666667

Migrated from: https://github.com/UV-CDAT/uvcdat/issues/1664

lee1043 commented 7 years ago

Same issue found when using cdutil.ANNUALCYCLE.departures(). Wondering if there was any progress with this issue...? @chaosphere2112 @doutriaux1

durack1 commented 7 years ago

@jypeter it seems you were lost in the original issue migration across repos, so pinging you here for your input

jypeter commented 7 years ago

I had kind of forgotten about this issue! I don't have anything new to add to this except maybe that the original thread also mentioned that improving the processing speed would be a nice addition

golaz commented 7 years ago

Thanks to @jypeter for reporting this bug. Let's hope it gets fixed soon.

Adding my 2 cents here regarding speed. I came up with an alternate direct computation of the climatologies using only 4 lines of code and got an even bigger speed-up. I'm seeing a 30x speed-up over cdutil with @jypeter's implementation and 1500x speed-up with my alternate implementation.

Here are the key lines:

# Compute climatologies
climato_values2 = np.zeros([12]+list(np.shape(var))[1:])
month = np.array( [ timax_absolute[i].month for i in range(len(timax_absolute)) ], dtype=np.int)
for i in range(1,13):
  climato_values2[i-1] = np.average(var[np.argwhere(month==i)],axis=0)

Attached is the script and below the full output. Just trying to raise the bar for @dnadeau4 and @doutriaux1 :wink:


Variable shape = (3653, 9, 11)
Time axes go from 1959-1-1 12:0:0.0 to 1968-12-31 12:0:0.0
Latitude values = [-66.7283256  -67.84978441 -68.97123954 -70.09269035 -71.21413608
 -72.33557575 -73.45700815 -74.57843166 -75.69984422]
Longitude values = [ 120.375  121.5    122.625  123.75   124.875  126.     127.125  128.25
  129.375  130.5    131.625]
Thu Mar  2 17:47:15 2017  - Computing the climatology with cdutil.ANNUALCYCLE.climatology,
                please wait...
Thu Mar  2 17:47:20 2017  - Done!
time.clock diff = 4.515801

Thu Mar  2 17:47:20 2017  - Computing the climatology by hand...
year, month, nb of selected days = 1959  2 28
year, month, nb of selected days = 1960  2 29
year, month, nb of selected days = 1961  2 28
year, month, nb of selected days = 1962  2 28
year, month, nb of selected days = 1963  2 28
year, month, nb of selected days = 1964  2 29
year, month, nb of selected days = 1965  2 28
year, month, nb of selected days = 1966  2 28
year, month, nb of selected days = 1967  2 28
year, month, nb of selected days = 1968  2 29
Difference range for month 1 = (-4.903731820604662e-06, 6.318861416332311e-06)
Difference range for February (unweighted) = (-5.2588326724389844e-06, 5.322724128120626e-06)
Difference range for February (WEIGHTED) = (-0.014047166386458088, -0.0006071392919806406)
Difference range for month 3 = (-7.192550192769431e-06, 6.8172331566529465e-06)
Difference range for month 4 = (-8.519490563685395e-06, 6.39597574547679e-06)
Difference range for month 5 = (-7.727838379878449e-06, 7.407895935784836e-06)
Difference range for month 6 = (-8.519490563685395e-06, 6.81559244952723e-06)
Difference range for month 7 = (-7.284841260002395e-06, 9.992045690410123e-06)
Difference range for month 8 = (-8.761498278886393e-06, 7.900114923131696e-06)
Difference range for month 9 = (-8.850097671597723e-06, 7.451375310552066e-06)
Difference range for month 10 = (-7.30945221505408e-06, 6.18965391652182e-06)
Difference range for month 11 = (-6.2370300426550784e-06, 6.243387851156967e-06)
Difference range for month 12 = (-6.288097772255696e-06, 6.534207223296562e-06)
Thu Mar  2 17:47:20 2017  - Finished computing the climatology by hand!
time.clock diff = 0.155475
Acceleration = 29.0451905451
Thu Mar  2 17:47:20 2017  - Computing the climatology by hand, second algorithm
Difference range for month 1 = (-6.318861437648593e-06, 4.903731806393807e-06)
Difference range for month 2 = (-2.3858707294266424e-06, 2.938530052176702e-06)
Difference range for month 3 = (-6.8172331779692286e-06, 7.192550164347722e-06)
Difference range for month 4 = (-6.39597574547679e-06, 8.519490556579967e-06)
Difference range for month 5 = (-7.4078959784174e-06, 7.727838337245885e-06)
Difference range for month 6 = (-6.81559244952723e-06, 8.519490556579967e-06)
Difference range for month 7 = (-9.992045747253542e-06, 7.284841231580685e-06)
Difference range for month 8 = (-7.900114979975115e-06, 8.761498236253829e-06)
Difference range for month 9 = (-7.451375324762921e-06, 8.850097657386868e-06)
Difference range for month 10 = (-6.189653952048957e-06, 7.309452179526943e-06)
Difference range for month 11 = (-6.243387858262395e-06, 6.237030028444224e-06)
Difference range for month 12 = (-6.5342072517182714e-06, 6.288097750939414e-06)
Thu Mar  2 17:47:20 2017  - Finished computing the climatology by hand, second algorithm!
time.clock diff = 0.002744
Acceleration = 1645.70007289

bug_ANNUALCYCLEclimatology2.py

doutriaux1 commented 7 years ago

@golaz this a totally different issues your script will barf on any general dataset. This is the difference between one off user geenrated script that works for his data only and general application software that works on any dataset. cdutil will work on 6hourly data for example. It will work on data starting in February. It will work on data with missing months, etc...

golaz commented 7 years ago

@doutriaux1, sure, everyone understands that there will be a performance penalty for generality, and generality is what makes cdutil in general and ANNUALCYCLE in particular so useful. Still, a factor 1500x in speed would suggest that there is substantial room for performance improvement without loss of generality.

Any progress with the fix?

doutriaux1 commented 7 years ago

@golaz I believe fix is in place. @dnadeau4 ?