Serialization is too slow

audeering / audobject

Generic Python interface for serializing objects to YAML

https://audeering.github.io/audobject/

Other

1 stars 0 forks source link

Serialization is too slow #90

Closed hagenw closed 9 months ago

hagenw commented 10 months ago

Serializing an object to YAML and loading it from there adds a big overhead (factor 100), which seems not reasonable to me:

import audobject
import auglib
import numpy as np
import time

def measure_wo_serialization():
    start = time.time()
    for _ in range(100):
        signal = np.zeros((1, 16000))
        transform = auglib.transform.Tone(1000)
        transform(signal)
    end = time.time()
    print(end - start)

def measure_w_serialization():
    start = time.time()
    for _ in range(100):
        signal = np.zeros((1, 16000))
        transform = auglib.transform.Tone(1000)
        transform = audobject.from_yaml_s(transform.to_yaml_s(include_version=False))
        transform(signal)
    end = time.time()
    print(end - start)

Then we get:

>>> measure_wo_serialization()
0.17473506927490234
>>> measure_w_serialization()
18.60582685470581

hagenw commented 9 months ago

I looked into it and the extra time is completely covered by importlib_metadata.packages_distributions() which we call inside audobject.core.utils.create_class_key():

https://github.com/audeering/audobject/blob/d1f3dd88d5202c36ef3a21055b42b99fc39e7415/audobject/core/utils.py#L30-L36

It's unfortunate as we only have a different package name in very rare cases. So it would be better to call it only if really needed, but so far I don't see how to achieve this.

frankenjoe commented 9 months ago

Maybe we can first try with module_name and only if we cannot create the object call packages_distributions()?

hagenw commented 9 months ago

The package_name is indeed only needed if autoinstall=True, see

https://github.com/audeering/audobject/blob/d1f3dd88d5202c36ef3a21055b42b99fc39e7415/audobject/core/utils.py#L104-L108

But the problem is we can only call packages_distributions() when the package is already installed, so indeed at the time we are storing the object and not when loading it as the mapping from module_name to package_name is ambiguous.

packages_distributions() is also collecting the information for all installed packages and not only for the package we require. Maybe we can look into the source code of packages_distributions(), provide the module name as argument and use only the code we need.

hagenw commented 9 months ago

The code inside importlib_metadata.packages_distributions() does indeed loop over all packages, which might not be needed, compare https://github.com/python/importlib_metadata/blob/353c3dfe83e08b3c86f30d192c315501cef97454/importlib_metadata/__init__.py#L955-L969