dkpro / dkpro-cassis

UIMA CAS processing library written in Python
https://pypi.org/project/dkpro-cassis/
Apache License 2.0
84 stars 22 forks source link

Use a cache for typesystem.is_instance_of() #266

Closed DavidHuebner closed 1 year ago

DavidHuebner commented 1 year ago

Is your feature request related to a problem? Please describe. I have a list of XMI CAS objects where I only saved a few, but rather complex types. I noticed that loading these types takes a rather long time. When I profiled the loading routine for 1000 CAS XMI, I noticed that about 40% of the time is spend in typesystem.is_instance_of(). This is because this function is called in CasXmiDeserializer.deserialize() https://github.com/dkpro/dkpro-cassis/blob/main/cassis/xmi.py#L220 for each feature and is_instance_of always recursively moves through the complex Type hierarchy.

Profiling results image

Describe the solution you'd like We should reduce the number of calls to typesystem.is_instance_of() by caching the results. I have prepared a small implementation. It speeds up the loading process by a third for me.

Without caching

100%|██████████| 1000/1000 [02:25<00:00, 6.86it/s] 100%|██████████| 1000/1000 [02:26<00:00, 6.83it/s] 100%|██████████| 1000/1000 [02:26<00:00, 6.81it/s]

With cache

100%|██████████| 1000/1000 [01:30<00:00, 11.05it/s] 100%|██████████| 1000/1000 [01:34<00:00, 10.64it/s] 100%|██████████| 1000/1000 [01:38<00:00, 10.12it/s]

Describe alternatives you've considered I thought about wrapping the function typesystem.is_instance_of() directly inside a cache. This would be a cleaner implementation, but since we allow typesystem changes, we would need to reset the cache every time something changes.

Additional context If needed, then I can provide the data for the loading benchmarks.

DavidHuebner commented 1 year ago

I wanted to prepare a Pull Request, but I think that I am missing the privileges to create a new branch. I am pasting the relevant changes here

Changes in xmi.py.

# See https://github.com/dkpro/dkpro-cassis/issues/266
# The checking for each feature if it is a StringArray is rather slow, hence, we cache the results
is_instance_of_string_array_map = {}

# Post-process feature values
for xmi_id, fs in feature_structures.items():
    t = typesystem.get_type(fs.type.name)

    for feature in t.all_features:
        feature_name = feature.name
        value = fs[feature_name]

        if feature_name == "sofa":
            fs[feature_name] = sofas[value]
            continue

        if fs.type.name not in is_instance_of_string_array_map:
            is_instance_of_string_array_map[fs.type.name] = typesystem.is_instance_of(fs.type.name,
                                                                                       TYPE_NAME_STRING_ARRAY)

        if is_instance_of_string_array_map[fs.type.name]:
            # We already parsed string arrays to a Python list of string
            # before, so we do not need to work more on this
            continue
reckart commented 1 year ago

To prepare a PR, you fork the repo, push the branch to your fork and then create the pull request. GitHub will usually even offer you to create the PR if you visit your branch in the web ui.

DavidHuebner commented 1 year ago

Thanks, I created a pull request here: https://github.com/dkpro/dkpro-cassis/pull/267