Performance of binDensity.py

eguil commented 9 years ago

Work to reduce CPU cost of binDensity Case study: IPSL-CM5A-LR 12 months Bining CPU analysis tc/tcdel = 0 12 average cpu1 = 0.06 average cpu2 = 0.555 average cpu3 = 0.05 average cpu4 = 5.4725 <---- see below average cpu40 = 0.09 average cpu5 = 0.204166666667

In bining section, the cost is in the interpolation:

        tcpu3 = timc.clock()
        for i in range(lonN*latN):
            if nomask[i]:
                z_s [0:N_s,i] = npy.interp(s_s[:,i], szm[:,i], zzm[:,i], right = valmask) ; # depth - consider spline
                c1_s[0:N_s,i] = npy.interp(z_s[0:N_s,i], zzm[:,i], c1m[:,i], right = valmask) ; # thetao
                c2_s[0:N_s,i] = npy.interp(z_s[0:N_s,i], zzm[:,i], c2m[:,i], right = valmask) ; # so
        # if level in s_s has lower density than surface, isopycnal is put at surface (z_s = 0)
        tcpu40 = timc.clock()

Overall cost: CPU of chunk inits = 9.93 CPU of density bining = 77.18 CPU of masking and var def = 12.06 CPU of annual mean compute = 11.74 CPU of interpolation = 4.6 CPU of zonal mean = 8.15 CPU of persistence compute = 44.51 CPU of chunk = 168.48 Max memory use 3.414428 GB [ Time stamp 21/01/2015 00:44:16 ] Max memory use 3.414428 GB Ratio to grid_nyears 4.06161364074 kB/unit(size_nyears) CPU use, elapsed 181.36 241.048659086 Ratio to grid_nyears 17.9779807405 1.e-6 sec/unit(size_nyears)

eguil commented 9 years ago

in Persistence, CPU is in this loop:

            # TODO: can we remove the loop ?
            #print '9a' # Warning: converting a masked element to nan               
            tpe1 = timc.clock()
            for i in range(latN*lonN): 
                ptopdepth[i]    = depth_bin[t,p_top[i],i]
                ptoptemp[i]     = x1_bin[t,p_top[i],i]
                ptopsalt[i]     = x2_bin[t,p_top[i],i]
            #print '9a1'                
            tpe2 = timc.clock()

eguil commented 9 years ago

Could these help ?

durack1 commented 9 years ago

@eguil good catch.. I've found a heap of memory issues using some of the scipy interpolate functions in times gone by.. I do think it would be great to really cast the net wide when we're rewriting this stuff.. The more performance we can squeeze out of this the better!

eguil commented 9 years ago

CPU optimisation in persistence loop + memory work to remove depth_bino and depth_bin et al. Persistence divided by 4 by removing loop.

New perf: CPU of chunk inits = 13.88 CPU of density bining = 74.06 CPU of masking and var def = 5.88 CPU of annual mean compute = 17.11 CPU of interpolation = 4.56 CPU of zonal mean = 8.6 CPU of persistence compute = 9.93 CPU of chunk = 134.34 Max memory use 2.690352 GB CPU of inits = 16.5 CPU inits detail = 0.04 0.06 1.32 0.02 12.41 2.65 [ Time stamp 05/02/2015 01:29:06 ] Max memory use 2.690352 GB Ratio to grid_nyears 3.20029310374 kB/unit(size_nyears) CPU use, elapsed 150.84 151.146143913 Ratio to grid_nyears 14.9525728655 1.e-6 sec/unit(size_nyears)

eguil commented 9 years ago

Last loop (binning) requires ESMF exploration

durack1 commented 9 years ago

Another potential option here is cython: http://technicaldiscovery.blogspot.com/2011/06/speeding-up-python-numpy-cython-and.html http://docs.cython.org/

I'm yet to use this effectively (and the cdms2 declarations might complicate things) but the "free" speedups certainly would be useful..

durack1 commented 9 years ago

And another option: mpi https://ice.txcorp.com/trac/modave/wiki/parallel

eguil commented 9 years ago

Article sent by Mark Greenslade: http://sebastianraschka.com/Articles/2014_multiprocessing_intro.html#Multi-Threading-vs.-Multi-Processing

durack1 commented 9 years ago

@eguil how is this going? Did we need to squeeze out some more here, or have you managed to get things really lean..?

Sorry for the lack of interaction, we had a computer die over here, just trying to get everything back up..

eguil commented 9 years ago

@durack1 I was on holidays last week - back to this now. I think at this stage parallel/threading is the next step. Any chance we could have help from Charles ? I will get the code ready (still a bug in integrals) in the next few days and we can rerun it on the historical.

eguil commented 9 years ago

Current perfs for 156 years of IPSL-CM5-LR: About 100 sec per chunk:

CPU of chunk inits         = 4.18
CPU of density bining      = 74.48
CPU of masking and var def = 1.7
CPU of annual mean compute = 2.98
CPU of interpolation       = 5.95
CPU of zonal mean          = 3.08
CPU of persistence compute = 7.18
CPU of chunk               = 100.04

For whole run:

Max memory use 5.685724 GB
Ratio to grid*nyears 0.0433552630415 kB/unit(size*nyears)
CPU use, elapsed 15829.5 18367.3725309
Ratio to grid*nyears 10.0587034052 1.e-6 sec/unit(size*nyears)

Current perfs for 156 years of MPI-ESM-LR:

CPU of chunk inits         = 17.28
CPU of density bining      = 165.64
CPU of masking and var def = 10.71
CPU of annual mean compute = 6.39
CPU of interpolation       = 6.28
CPU of zonal mean          = 3.05
CPU of persistence compute = 10.77
CPU of chunk               = 220.57
Max memory use 9.284136 GB
Ratio to grid*nyears 0.026417654611 kB/unit(size*nyears)
CPU use, elapsed 35370.42 35923.2771571
Ratio to grid*nyears 8.38709833461 1.e-6 sec/unit(size*nyears)

durack1 commented 9 years ago

@eguil ok so time is down but memory is up - I assume that running with MIROC4h is going to blow things up again.. Have you tried?

We've had the same system die again, so I might be a little slow to start up on this..

eguil commented 9 years ago

@durack1 MIROC4h test for 12 months:

CPU of chunk inits         = 1063.9
CPU of density bining      = 3996.23
CPU of masking and var def = 399.49
CPU of annual mean compute = 185.79
CPU of interpolation       = 13.53
CPU of zonal mean          = 6.88
CPU of persistence compute = 235.55
CPU of chunk               = 5901.69
Max memory use 80.588012 GB
Ratio to grid*nyears 1.43821693108 kB/unit(size*nyears)
CPU use, elapsed 5928.43 6056.77837396
Ratio to grid*nyears 8.81682873702 1.e-6 sec/unit(size*nyears)

durack1 commented 9 years ago

@eguil great, so memory usage is down too, so we should be able to process all models and simulations in one run.. That 80GB should really be something we can halve though.. The MPI-ESM-MR is the higher res (L=low, M=medium)

durack1 commented 9 years ago

@eguil you most likely need to initialize/load the queue object from the multiprocessing module:

https://github.com/eguil/Density_bining/blob/master/binDensity.py#L40-41 vs https://github.com/momipsl/Density_bining/blob/master/binDensity.py#L39-42

import numpy as npy
import multiprocessing as mp
from mp import Queue
from string import replace

durack1 commented 7 years ago

Also see #36 (closed)

eguil commented 5 years ago

Solutions proposed by Nicolas with running time chunks in parallel @lebasn

eguil / Density_bining

Performance of binDensity.py #33