Parallel version of key loop in binDensity.py

eguil commented 9 years ago

@momipsl @durack1 Goal is to reduce elapsed time of "75% loop"

eguil commented 9 years ago

First attempt by Mark: Eric

OK thanks for this information, so the data types are all standard that simplifies things.

Regarding LonNLatN. In the code the number of process to spin off is determined by the list comprehension at line 1485: [i for i in range(lonN \ latN) if nomask[i]]

Can you give me an estimation of the length of this array as this gives us the number of processes to execute in parallel.

Regards

Mark

I have committed the typo correction.

Regarding memory issue I will need to sit down with you and understand in more detail the size and type of data structures being copied to the forked processes. Then instead of passing copies of these data structures to the forked processes we will need to allow the forked processes to access shared memory - this can of course lead to synchronisation issues. ok

The data structures being passed to the forked processes are defined in lines 1487 - 1494. Can you please comment upon the data types of the following data structures. I assume that they are numpy arrays of some form or another. s_s, szm, zzm, z_s, c1m, c2m, N_s, valmask s_s, szm, zzm, z_s, c1m and c2m are all numpy float32 arrays. Dimension are horizonal x vertical or density i.e either LonN_LatN x depthN (e.g. large x small e.g. 30000 x 30) or LonN_LatN x N_s (e.g. 30000 x 100)

N_s and valmask are int and float numbers

Best, Eric

Simply clone the repo somewhere and run as normal. E.G. cd YOUR_WORKING_DIRECTORY git clone https://github.com/momipsl/Density_bining.git density-bin-parallel cd density-bin-parallel

eguil commented 9 years ago

@momipsl LonN*LatN is typically 30 000 to 100 000 (with 30% of masked points)

eguil commented 9 years ago

From Mark:

OK I have just committed an UNTESTED implementation of a multi-threaded depth interpolation. This implementation places the interpolations to be performed in a task queue. Task workers then perform the interpolations on separate threads.

At line 1472 the variable MAX_THREADS is declared, this is the maximum number of threads to be created. You may wish to experiment with this number, i.e. run a test, increment MAX_THREADS, run the test again. At some point you will see that the job time reaches a plateau (and may even increase) as the cost of thread switching exceeds the savings from parallelising the interpolation.

Note that this implementation is untested and furthermore it is predicated upon numpy supporting concurrent access to arrays. It is uncertain whether all input data types are indeed thread safe if not then we will need to extend this implementation so that all variable access is via explicit thread locks.

eguil commented 9 years ago

ok first tests: 1) missing args c1_s and c2_s in call from main code 2) crashes with:

Traceback (most recent call last):
  File "/home/ericglod/Density_bining/density-bin-parallel/drive_ACCESS1-0.py", line 39, in <module>
    densityBin(modelThetao,modelSo,modelAreacello,outfileDensity,timeint='1,12')
  File "/home/ericglod/Density_bining/density-bin-parallel/binDensity.py", line 684, in densityBin
    valmask)
  File "/home/ericglod/Density_bining/density-bin-parallel/binDensity.py", line 1476, in  _parallellize_interpolate_depth
    task_q = Queue(maxsize=0)
NameError: global name 'Queue' is not defined

asladeofgreen commented 9 years ago

Just committed update that correctly imports the python modules queue and threading.

asladeofgreen commented 9 years ago

Considering the size of LonN*LatN (i.e. 30k - 100k) then we will probably have to process these in batches using generators. I will await results of tests before implementing this.

durack1 commented 9 years ago

It was committed to your eguil/Density_bining repo directly: https://github.com/eguil/Density_bining/blob/master/binDensity.py#L45-46

eguil commented 9 years ago

@momipsl

that would be great - thanks ! The binDensityMP.py code is now running and producing results that are correct - but I don't see any speed up yet. Making further tests with using more nodes/cpus.

On 6/3/15 20:32, Paul J. Durack wrote:

@eguil https://github.com/eguil did you want me to do a revert here? I'll get binDensity.py back to pre-threads version and rename the threaded version binDensityMP.py - is that ok?

I'll then issue you a pull request so you can review and then merge changes back in..

â Reply to this email directly or view it on GitHub https://github.com/eguil/Density_bining/issues/37#issuecomment-77620833.

Eric Guilyardi IPSL/LOCEAN - Dir. Rech. CNRS Tour 45, 4eme, piece 406 UPMC, case 100 4 place Jussieu, F-75252 Paris Tel: +33 (0)1 44 27 70 76 Prof. Eric Guilyardi NCAS Climate Meteorology Department University of Reading Reading RG6 6BB - UK Tel: +44 (0)118 378 8315

             http://ncas-climate.nerc.ac.uk/~ericg

eguil commented 9 years ago

Here are the results I have with the current threaded version. Either not doing what is expected or I am not using it in the right way:

 standard code: CPU use, elapsed 328.99 632.229090929
 threaded code (max_threads=5 , 1 node  1 CPU): CPU use, elapsed  300.6  672.0
 threaded code (max_threads=8 , 1 node  1 CPU): CPU use, elapsed  292.3  608.5
 threaded code (max_threads=16, 1 node  1 CPU): CPU use, elapsed  329.3  855.5
 threaded code (max_threads=25, 1 node  1 CPU): CPU use, elapsed  368.4  618.7
 threaded code (max_threads=8 , 1 node  8 CPU): CPU use, elapsed  739.8  851.1
 threaded code (max_threads=8 , 2 nodes 8 CPU): CPU use, elapsed 1154.0 1242.9

durack1 commented 9 years ago

@eguil this should sort things out https://github.com/eguil/Density_bining/pull/38

asladeofgreen commented 9 years ago

Upon conducting a review it is clear that the current implementation of parallelisation leveraging python's threading module will not result in any performance gains. This is due to the python GIL which blocks CPU intensive threads span from the same process. This technique would work fine if the work to be performed was I/O based rather than CPU based.

Therefore the implementation must revert back to leveraging python's multiprocessing module. This complicates the code as the arrays being processed will need to be copied to/from shared memory data structures. I will endeavour to commit an initial implementation of this in the next day or so.

eguil commented 9 years ago

Thanks Mark. This would explain the result I summarised above. Can you please develop in binDensityMP.py from now on ?

asladeofgreen commented 9 years ago

New implementation of binDensityMP.py This uses python multiprocessing library to parallelize the interpolations. The arrays (c1m, c2m, s_s, szm, zzm) are placed in shared memory so that they do not need to be copied to the spawned processes. The number of spawned processes = the number of available cores - 1. python generators are used as another memory consumption mitigation strategy.

Note - this is UNTESTED !

eguil commented 9 years ago

Hi Mark: result from first test:

Traceback (most recent call last): File "/home/ericglod/Density_bining/drive_CSIRO_piC.py", line 34, in densityBin(modelThetao,modelSo,modelAreacello,outfileDensity,timeint='1,12') File "/home/ericglod/Density_bining/binDensityMP.py", line 687, in densityBin valmask) File "/home/ericglod/Density_bining/binDensityMP.py", line 1549, in _parallellize_depth_interpolation ss = np.ctypeslib.as_ctypes(s_s) File "/home/ericglod/PCMDI_Metrics/PCMDI_METRICS/lib/python2.7/site-packages/numpy/ctypeslib.py", line 415, in as_ctypes raise TypeError("strided arrays not supported") TypeError: strided arrays not supported

asladeofgreen commented 9 years ago

Just committed a patch to resolve this. It seems that the input array s_s is strided, i..e the result of a transposition. This causes np.ctypeslib.as_ctypes(ss) to fail. Workaround is to use a copy of the array. If the other input arrays szm & zzm_ are also strided then they will also have to be copied.

eguil commented 9 years ago

actually s_s (target density grid) is a lonN*LatN replicate of the same 1D vertical vector, if that can help to simplify things. I made it like this so that we could do a matrix computation.

eguil commented 9 years ago

result of patch: =>> PBS: job killed: vmem 24252403712 exceeded limit 12884901888

asladeofgreen commented 9 years ago

It would be helpful if you could get vmdm dumps at lines 1544 and 1563. This will tell us whether copying the arrays c1m, c2m, s_s, szm & zzm to shared memory is inflating vmdm usage. If not then it must be in the spawned processes.

eguil commented 9 years ago

Hi mark, what is a "vmdm" dump and how can I get it ?

On 11/3/15 15:31, Mark A. Greenslade wrote:

It would be helpful if you could get vmdm dumps at lines 1544 and

This will tell us whether copying the arrays c1m, c2m, s_s, szm & zzm to shared memory is inflating vmdm usage. If not then it must be in the spawned processes.

â Reply to this email directly or view it on GitHub https://github.com/eguil/Density_bining/issues/37#issuecomment-78273887.

Eric Guilyardi IPSL/LOCEAN - Dir. Rech. CNRS Tour 45, 4eme, piece 406 UPMC, case 100 4 place Jussieu, F-75252 Paris Tel: +33 (0)1 44 27 70 76 Prof. Eric Guilyardi NCAS Climate Meteorology Department University of Reading Reading RG6 6BB - UK Tel: +44 (0)118 378 8315

             http://ncas-climate.nerc.ac.uk/~ericg

asladeofgreen commented 9 years ago

Sorry this is a typo. It should have read vmem, i.e. virtual memory dump. I have just committed a change that outputs to stdout memory usage using the python resource module.

PS - a good article on the subject of debugging python program memory issues: http://chase-seibert.github.io/blog/2013/08/03/diagnosing-memory-leaks-python.html

eguil commented 9 years ago

Here is the output:

Total volume in z coordinates source grid (ref = 1.33 e+18) : 1.31679820735e+18 Mean Temp./Salinity in z coordinates source grid : 3.32230840011 34.7277874016 binDensityMP.INFO :: depth interpolation :: memory usage :: initial :: 2669204 (kb) binDensityMP.INFO :: depth interpolation :: memory usage :: array conversions step 1 - completed :: 2669204 (kb) binDensityMP.INFO :: depth interpolation :: memory usage :: array conversions step 2 - completed :: 2669204 (kb) binDensityMP.INFO :: depth interpolation :: memory usage :: completed :: 2669204 (kb) binDensityMP.INFO :: depth interpolation :: memory usage :: initial :: 2669204 (kb) binDensityMP.INFO :: depth interpolation :: memory usage :: array conversions step 1 - completed :: 2669204 (kb) binDensityMP.INFO :: depth interpolation :: memory usage :: array conversions step 2 - completed :: 2669204 (kb) binDensityMP.INFO :: depth interpolation :: memory usage :: completed :: 2669204 (kb) binDensityMP.INFO :: depth interpolation :: memory usage :: initial :: 2669204 (kb) binDensityMP.INFO :: depth interpolation :: memory usage :: array conversions step 1 - completed :: 2669204 (kb) binDensityMP.INFO :: depth interpolation :: memory usage :: array conversions step 2 - completed :: 2669204 (kb) binDensityMP.INFO :: depth interpolation :: memory usage :: completed :: 2669204 (kb) binDensityMP.INFO :: depth interpolation :: memory usage :: initial :: 2669204 (kb) binDensityMP.INFO :: depth interpolation :: memory usage :: array conversions step 1 - completed :: 2669204 (kb) binDensityMP.INFO :: depth interpolation :: memory usage :: array conversions step 2 - completed :: 2669204 (kb) binDensityMP.INFO :: depth interpolation :: memory usage :: completed :: 2669204 (kb) binDensityMP.INFO :: depth interpolation :: memory usage :: initial :: 2669204 (kb) binDensityMP.INFO :: depth interpolation :: memory usage :: array conversions step 1 - completed :: 2669204 (kb) binDensityMP.INFO :: depth interpolation :: memory usage :: array conversions step 2 - completed :: 2669204 (kb) binDensityMP.INFO :: depth interpolation :: memory usage :: completed :: 2669204 (kb) binDensityMP.INFO :: depth interpolation :: memory usage :: initial :: 2669204 (kb) binDensityMP.INFO :: depth interpolation :: memory usage :: array conversions step 1 - completed :: 2669204 (kb) binDensityMP.INFO :: depth interpolation :: memory usage :: array conversions step 2 - completed :: 2669204 (kb) binDensityMP.INFO :: depth interpolation :: memory usage :: completed :: 2669204 (kb) binDensityMP.INFO :: depth interpolation :: memory usage :: initial :: 2669204 (kb) binDensityMP.INFO :: depth interpolation :: memory usage :: array conversions step 1 - completed :: 2669204 (kb) binDensityMP.INFO :: depth interpolation :: memory usage :: array conversions step 2 - completed :: 2669204 (kb) binDensityMP.INFO :: depth interpolation :: memory usage :: completed :: 2669204 (kb) binDensityMP.INFO :: depth interpolation :: memory usage :: initial :: 2669204 (kb) binDensityMP.INFO :: depth interpolation :: memory usage :: array conversions step 1 - completed :: 2669204 (kb) binDensityMP.INFO :: depth interpolation :: memory usage :: array conversions step 2 - completed :: 2669204 (kb) binDensityMP.INFO :: depth interpolation :: memory usage :: completed :: 2669204 (kb) binDensityMP.INFO :: depth interpolation :: memory usage :: initial :: 2669204 (kb) binDensityMP.INFO :: depth interpolation :: memory usage :: array conversions step 1 - completed :: 2669204 (kb) binDensityMP.INFO :: depth interpolation :: memory usage :: array conversions step 2 - completed :: 2669204 (kb) =>> PBS: job killed: vmem 24195518464 exceeded limit 12884901888 Epilogue Args: Job ID: 278207.cicladpbs6.private.ipsl.fr User ID: ericglod Group ID: ipsl Job Name: script_qsub_parallel Session ID: 15983 Resource List: mem=10gb,neednodes=1:ppn=1,nodes=1:ppn=1,vmem=12gb,walltime=02:00:00 Resources Used: cput=00:03:44,mem=18854296kb,vmem=23631316kb,walltime=00:09:51 Queue Name: short Running Host: ciclad5.private.ipsl.fr

On 10/3/15 13:59, Mark A. Greenslade wrote:

Just committed a patch to resolve this. It seems that the input array s_s is strided, i..e the result of a transposition. This causes np.ctypeslib.as_ctypes(ss) to fail. Workaround is to use a copy of the array. If the other input arrays szm & zzm_ are also strided then they will also have to be copied.

â Reply to this email directly or view it on GitHub https://github.com/eguil/Density_bining/issues/37#issuecomment-78047779.

Eric Guilyardi IPSL/LOCEAN - Dir. Rech. CNRS Tour 45, 4eme, piece 406 UPMC, case 100 4 place Jussieu, F-75252 Paris Tel: +33 (0)1 44 27 70 76 Prof. Eric Guilyardi NCAS Climate Meteorology Department University of Reading Reading RG6 6BB - UK Tel: +44 (0)118 378 8315

             http://ncas-climate.nerc.ac.uk/~ericg

asladeofgreen commented 9 years ago

How many cores did this job have access to?

eguil commented 9 years ago

one for now - I am first testing with one core one cpu before increasing them

On 12/3/15 09:13, Mark A. Greenslade wrote:

How many cores did this job have access to?

â Reply to this email directly or view it on GitHub https://github.com/eguil/Density_bining/issues/37#issuecomment-78438729.

Eric Guilyardi IPSL/LOCEAN - Dir. Rech. CNRS Tour 45, 4eme, piece 406 UPMC, case 100 4 place Jussieu, F-75252 Paris Tel: +33 (0)1 44 27 70 76 Prof. Eric Guilyardi NCAS Climate Meteorology Department University of Reading Reading RG6 6BB - UK Tel: +44 (0)118 378 8315

             http://ncas-climate.nerc.ac.uk/~ericg

asladeofgreen commented 9 years ago

This is designed to run on multiple cores. If there is only one core then you will see no benefits as the value of MAX_PROCESSES in line 1581 will be set to 1. In this scenario the job will only be able to spawn a single sub-process which defeats the purpose of the exercise.

I have updated the code to run sequentially if the number of cores is <= 2 (this leaves one core for system tasks). So you need at least 3 cores before the depth interpolation are run in parallel

eguil commented 9 years ago

ok - I will try again

On 12/3/15 09:33, Mark A. Greenslade wrote:

This is designed to run on multiple cores. If there is only one core then you will see no benefits as the value of MAX_PROCESSES in line 1581 will be set to 1. In this scenario the job will only be able to spawn a single sub-process which defeats the purpose of the exercise.

I have updated the code to run sequentially if the number of cores is <= 2 (this leaves one core for system tasks). So you need at least 3 cores before the depth interpolation are run in parallel

â Reply to this email directly or view it on GitHub https://github.com/eguil/Density_bining/issues/37#issuecomment-78440785.

Eric Guilyardi IPSL/LOCEAN - Dir. Rech. CNRS Tour 45, 4eme, piece 406 UPMC, case 100 4 place Jussieu, F-75252 Paris Tel: +33 (0)1 44 27 70 76 Prof. Eric Guilyardi NCAS Climate Meteorology Department University of Reading Reading RG6 6BB - UK Tel: +44 (0)118 378 8315

             http://ncas-climate.nerc.ac.uk/~ericg

eguil commented 9 years ago

Here is the ouput with us 3 cores:

binDensityMP.INFO :: depth interpolation :: EXECUTING IN PARALLEL binDensityMP.INFO :: depth interpolation :: memory usage :: initial :: 2674972 (kb) binDensityMP.INFO :: depth interpolation :: memory usage :: array conversions step 1 - completed :: 2674972 (kb) binDensityMP.INFO :: depth interpolation :: memory usage :: array conversions step 2 - completed :: 2674972 (kb) binDensityMP.INFO :: depth interpolation :: process pool created: max-processes = 7 binDensityMP.INFO :: depth interpolation :: memory usage :: completed :: 2674972 (kb) binDensityMP.INFO :: depth interpolation :: EXECUTING IN PARALLEL binDensityMP.INFO :: depth interpolation :: memory usage :: initial :: 2674972 (kb) binDensityMP.INFO :: depth interpolation :: memory usage :: array conversions step 1 - completed :: 2674972 (kb) binDensityMP.INFO :: depth interpolation :: memory usage :: array conversions step 2 - completed :: 2674972 (kb) =>> PBS: job killed: vmem 24101470208 exceeded limit 12884901888

On 12/3/15 09:33, Mark A. Greenslade wrote:

This is designed to run on multiple cores. If there is only one core then you will see no benefits as the value of MAX_PROCESSES in line 1581 will be set to 1. In this scenario the job will only be able to spawn a single sub-process which defeats the purpose of the exercise.

I have updated the code to run sequentially if the number of cores is <= 2 (this leaves one core for system tasks). So you need at least 3 cores before the depth interpolation are run in parallel

â Reply to this email directly or view it on GitHub https://github.com/eguil/Density_bining/issues/37#issuecomment-78440785.

Eric Guilyardi IPSL/LOCEAN - Dir. Rech. CNRS Tour 45, 4eme, piece 406 UPMC, case 100 4 place Jussieu, F-75252 Paris Tel: +33 (0)1 44 27 70 76 Prof. Eric Guilyardi NCAS Climate Meteorology Department University of Reading Reading RG6 6BB - UK Tel: +44 (0)118 378 8315

             http://ncas-climate.nerc.ac.uk/~ericg

durack1 commented 7 years ago

This may be relevant in #33

eguil / Density_bining

Parallel version of key loop in binDensity.py #37