Open hagenw opened 2 months ago
I sometimes run into problems with multi-processing, e.g. an older version ofopensmile
was not supporting I think.
Yes, I also remembered that multiprocessing=False
seemed to be the safer choice, and in audb
it does provide the expected speed enhancement when downloading files. But I wonder, if this might be different when executing the process function in audinterface
.
I think "heavy processing" is always relative but anyway, the overhead might still occupy most of the computing time.
Measuring time spent in the processing function:
import audb
import audinterface
import audmath
import time
def process_func(signal, sampling_rate):
global tsum
tx = time.time()
res = audmath.db(audmath.rms(signal))
tsum += time.time() - tx
return res
db = audb.load("emodb", version="1.4.1")
for multiprocessing in [False, True]:
for num_workers in [1, 5]:
interface = audinterface.Feature(
["rms"],
process_func=process_func,
num_workers=num_workers,
multiprocessing=multiprocessing,
)
tsum = 0.
t0 = time.time()
df = interface.process_index(db.files)
t = time.time() - t0
print(f"{multiprocessing=}, {num_workers=}: {t:.2f} s, "
f"processing time: {tsum:.2f} s")
multiprocessing=False, num_workers=1: 0.87 s, processing time: 0.06 s
multiprocessing=False, num_workers=5: 0.47 s, processing time: 0.60 s
multiprocessing=True, num_workers=1: 0.40 s, processing time: 0.05 s
multiprocessing=True, num_workers=5: 0.39 s, processing time: 0.00 s
The figure for the last row (multiprocessing) is not correct with this method, of course, but for the outputs with one worker, we see that only a small part of the execution time is spent in process_func
and differences might be mainly due to overheads.
I repeated the measurement with opensmile
:
import audb
import audmath
import opensmile
import time
db = audb.load("emodb", version="1.4.1")
for multiprocessing in [False, True]:
for num_workers in [1, 5]:
interface = opensmile.Smile(
num_workers=num_workers,
multiprocessing=multiprocessing,
)
t0 = time.time()
df = interface.process_index(db.files)
t = time.time() - t0
print(f"{multiprocessing=}, {num_workers=}: {t:.2f} s")
and there it does not make a difference if we use multi-processing or not:
multiprocessing=False, num_workers=1: 20.27 s
multiprocessing=False, num_workers=5: 6.29 s
multiprocessing=True, num_workers=1: 20.32 s
multiprocessing=True, num_workers=5: 6.54 s
But when testing with another feature extractor:
import audb
import audmath
import audmld
import time
db = audb.load("emodb", version="1.4.1")
for multiprocessing in [False, True]:
for num_workers in [1, 5]:
interface = audmld.Mld(
num_workers=num_workers,
multiprocessing=multiprocessing,
)
t0 = time.time()
df = interface.process_index(db.files)
t = time.time() - t0
print(f"{multiprocessing=}, {num_workers=}: {t:.2f} s")
there is indeed a difference:
multiprocessing=False, num_workers=1: 118.00 s
multiprocessing=False, num_workers=5: 189.54 s
multiprocessing=True, num_workers=1: 106.39 s
multiprocessing=True, num_workers=5: 46.43 s
So I guess, this indicates that we did some (wrong?) choice in its implementation, resulting to support only multiprocessing?
At the moment we have as default
multiprocessing=False
, but I wonder what was/is the reasoning behind it.When browsing the web, I can find the following statement:
When doing a simple test:
it returns (after running the second time)
Even though we don't do heavy processing here, multi-processing seems to be faster in this case. Is this expected?
/cc @ureichel, @ChristianGeng, @frankenjoe, @maxschmitt, @audeerington, @schruefer