Multithreading vs. multiprocessing

hagenw commented 2 months ago

At the moment we have as default multiprocessing=False, but I wonder what was/is the reasoning behind it.

When browsing the web, I can find the following statement:

multi-threading is good for IO-bound processes like reading or downloading files
multi-processing is good for computational heavy tasks

When doing a simple test:

import audb
import audinterface
import audmath
import time

def process_func(signal, sampling_rate):
    return audmath.db(audmath.rms(signal))

db = audb.load("emodb", version="1.4.1")
for multiprocessing in [False, True]:
    for num_workers in [1, 5]:
        interface = audinterface.Feature(
            ["rms"],
            process_func=process_func,
            num_workers=num_workers,
            multiprocessing=multiprocessing,
        )
        t0 = time.time()
        df = interface.process_index(db.files)
        t = time.time() - t0
        print(f"{multiprocessing=}, {num_workers=}: {t:.2f} s")

it returns (after running the second time)

multiprocessing=False, num_workers=1: 0.16 s                                                        
multiprocessing=False, num_workers=5: 0.26 s
multiprocessing=True, num_workers=1: 0.16 s
multiprocessing=True, num_workers=5: 0.11 s

Even though we don't do heavy processing here, multi-processing seems to be faster in this case. Is this expected?

/cc @ureichel, @ChristianGeng, @frankenjoe, @maxschmitt, @audeerington, @schruefer

frankenjoe commented 2 months ago

I sometimes run into problems with multi-processing, e.g. an older version ofopensmile was not supporting I think.

hagenw commented 2 months ago

Yes, I also remembered that multiprocessing=False seemed to be the safer choice, and in audb it does provide the expected speed enhancement when downloading files. But I wonder, if this might be different when executing the process function in audinterface.

maxschmitt commented 2 months ago

I think "heavy processing" is always relative but anyway, the overhead might still occupy most of the computing time.

Measuring time spent in the processing function:

import audb
import audinterface
import audmath
import time

def process_func(signal, sampling_rate):
    global tsum
    tx = time.time()
    res = audmath.db(audmath.rms(signal))
    tsum += time.time() - tx
    return res

db = audb.load("emodb", version="1.4.1")
for multiprocessing in [False, True]:
    for num_workers in [1, 5]:
        interface = audinterface.Feature(
            ["rms"],
            process_func=process_func,
            num_workers=num_workers,
            multiprocessing=multiprocessing,
        )
        tsum = 0.
        t0 = time.time()
        df = interface.process_index(db.files)
        t = time.time() - t0
        print(f"{multiprocessing=}, {num_workers=}: {t:.2f} s, "
              f"processing time: {tsum:.2f} s")

multiprocessing=False, num_workers=1: 0.87 s, processing time: 0.06 s
multiprocessing=False, num_workers=5: 0.47 s, processing time: 0.60 s
multiprocessing=True, num_workers=1: 0.40 s, processing time: 0.05 s
multiprocessing=True, num_workers=5: 0.39 s, processing time: 0.00 s

The figure for the last row (multiprocessing) is not correct with this method, of course, but for the outputs with one worker, we see that only a small part of the execution time is spent in process_func and differences might be mainly due to overheads.

hagenw commented 2 months ago

I repeated the measurement with opensmile:

import audb
import audmath
import opensmile
import time

db = audb.load("emodb", version="1.4.1")
for multiprocessing in [False, True]:
    for num_workers in [1, 5]: 
        interface = opensmile.Smile(
            num_workers=num_workers,
            multiprocessing=multiprocessing,
        )
        t0 = time.time()
        df = interface.process_index(db.files)
        t = time.time() - t0
        print(f"{multiprocessing=}, {num_workers=}: {t:.2f} s")

and there it does not make a difference if we use multi-processing or not:

multiprocessing=False, num_workers=1: 20.27 s                                                       
multiprocessing=False, num_workers=5: 6.29 s
multiprocessing=True, num_workers=1: 20.32 s
multiprocessing=True, num_workers=5: 6.54 s

But when testing with another feature extractor:

import audb
import audmath
import audmld
import time

db = audb.load("emodb", version="1.4.1")
for multiprocessing in [False, True]:
    for num_workers in [1, 5]: 
        interface = audmld.Mld(
            num_workers=num_workers,
            multiprocessing=multiprocessing,
        )
        t0 = time.time()
        df = interface.process_index(db.files)
        t = time.time() - t0
        print(f"{multiprocessing=}, {num_workers=}: {t:.2f} s")

there is indeed a difference:

multiprocessing=False, num_workers=1: 118.00 s                                                      
multiprocessing=False, num_workers=5: 189.54 s
multiprocessing=True, num_workers=1: 106.39 s
multiprocessing=True, num_workers=5: 46.43 s

So I guess, this indicates that we did some (wrong?) choice in its implementation, resulting to support only multiprocessing?

audeering / audinterface

Multithreading vs. multiprocessing #171