pymp

This package brings OpenMP-like functionality to Python. It takes the good qualities of OpenMP such as minimal code changes and high efficiency and combines them with the Python Zen of code clarity and ease-of-use.

Usage

For-loops, such as:

from __future__ import print_function

ex_array = np.zeros((100,), dtype='uint8')
for index in range(0, 100):
    ex_array[index] = 1
    print('Yay! {} done!'.format(index))

become:

from __future__ import print_function

import pymp
ex_array = pymp.shared.array((100,), dtype='uint8')
with pymp.Parallel(4) as p:
    for index in p.range(0, 100):
        ex_array[index] = 1
        # The parallel print function takes care of asynchronous output.
        p.print('Yay! {} done!'.format(index))

The GIL (global interpreter lock) is circumvented by using the operating system's fork method. Due to the copy-on-write strategy, this causes only a minimal overhead and results in the expected semantics. On the other hand, the package will only work on systems with fork support (sorry, not on Windows).

Installation

The package is available from pypi. Due to a name clash, it is available as pymp-pypi:

pip install pymp-pypi

To get the cutting edge version from github, do:

git clone https://github.com/classner/pymp.git
cd pymp
python setup.py develop

In theory, Python 3.2 and Python 3.3 should be supported as well. For some reason, the travis-ci builds hangs AFTER completing the test runs. If someone confirms successful testing on these versions, please let me know so that I include them in the compatibility list.

Features

Environment variables and configuration

The module is configurable by environment variables as well as at runtime. It respects the environment variables:

PYMP_NESTED / OMP_NESTED: 'TRUE' or 'FALSE' (default: 'FALSE'),
PYMP_THREAD_LIMIT / OMP_THREAD_LIMIT: int > 0 (default: unset),
PYMP_NUM_THREADS / OMP_NUM_THREADS: comma-separated list of int > 0, the number of threads to use per nesting level. If only one value is provided, it is used for all levels. Default: number of cores.

The PYMP variables are used with preference. At runtime, the configuration values can be set at any time by using: pymp.config.nested, pymp.config.thread_limit and pymp.config.num_threads.

OpenMP variables

Every parallel context provides its number of threads and the current thread's thread_num in the same way OpenMP does:

with pymp.Parallel(4) as p:
    p.print(p.num_threads, p.thread_num)

The original thread entering the parallel context always has thread_num 0.

Schedules

The basic OpenMP scheduling types map directly to the classical Python ranges: using pymp.range corresponds to the static schedule by returning a complete list of indices, while pymp.xrange returns an iterator and corresponds to dynamic scheduling.

You can use p.iterate to iterate over arbitrary list elements. However, bearing efficiency in mind you should create complex or large objects before the parallel section. Otherwise, they have to be serialized and forwarded through the iterator to the consuming process.

Variable scopes

The only implemented variable scopes are firstprivate, shared and private. All variables that are declared before the pymp.Parallel call are implicitly firstprivate, all variables from the pymp.shared module are shared, and all variables created within a pymp.Parallel context are private.

The package pymp.shared provides a numpy array wrapper accepting the standard datatype strings, as well as shared list, dict, queue, lock and rlock objects wrapped from multiprocessing. High performance shared memory (ctypes) datastructues are array, lock and rlock, the other datastructures are synchronized via a multiprocessing.Manager and hence a little slower.

All datastructures must be synchronized manually, if required, by using a lock. The parallel context offers one for your convenience:

ex_array = pymp.shared.array((1,), dtype='uint8')
with pymp.Parallel(4) as p:
    for index in p.range(0, 100):
        with p.lock:
            ex_array[0] += 1

Nested loops

When pymp.config.nested is True, it is possible to nest parallel contexts with the expected semantics:

with pymp.Parallel(2) as p1:
    with pymp.Parallel(2) as p2:
        p.print(p1.thread_num, p2.thread_num)

Parallel sections

There is no special context for parallel sections. Please use a pymp.range or pymp.xrange and if-else to achieve the expected behavior:

with pymp.Parallel(4) as p:
    for sec_idx in p.xrange(4):
        if sec_idx == 0:
            p.print('Section 0')
        elif sec_idx == 1:
            p.print('Section 1')
        ...

Exceptions

Exceptions will be raised in the main program. However, there can be as many fatal Exceptions as sub-processes at the end of a parallel context. They are logged by the logger as critical, so you can always redirect their output.

All exceptions will be re-raised in the main program at the end of the parallel section with their proper exception type and error message. It is unavoidable that their stack-traces are lost, unfortunately. For easy debugging, use the pymp.Parallel(..., if_=False) flag to temporarily disable parallelism.

Conditional parallelism

As mentioned in the preceding paragraph, parallel execution can be disabled regardless of other settings by passing if_=False to the parallel region constructor.

Reductions

There is on purpose no method for reductions implemented for four reasons:

due to the higher level of the Python language compared to C++, it is very easy to create a shared list and do the reduction after the loop, which,
corresponds more to the explicit is better than implicit Zen of Python and
can be realized cleanly with the language means that are available, while it is
perfectly deterministic.

The last point is not necessarily true for OpenMP reductions.

Iterables

Additional to these more traditional OpenMP functionalities, pymp provides a more Pythonic way of parallelization: parallel iterators. This is a powerful paradigm, since iterators can be stacked. pymp uses a producer- consumer pattern with the main thread as producer always and the rest of the threads as consumers of the iterable. This is as easy as

with pymp.Parallel(4) as p:
    for iter_item in p.iterate(xrange(4)):
        p.print(iter_item)

The iteration items must be picklable to be transferred through a queue.

How does it work?

When entering a parallel context, processes are forked as necessary. That means that child processes are started, which are in (nearly) exactly the same state as the creating process. The memory is not copied, but referenced. Only when a process writes into a part of the memory it gets its own copy of the corresponding memory region. This keeps the processing overhead low (but of course not as low as for OpenMP threads).

Once the parallel region is left, child processes exit and only the original process 'survives'. The 'shared' datastructures from the corresponding submodule are synchronized either via shared memory or using a manager process and the pickle protocol (see the documentation of the multiprocessing module for more information).

classner / pymp

readme