Closed DavidHuebner closed 1 year ago
I wanted to prepare a Pull Request, but I think that I am missing the privileges to create a new branch. I am pasting the relevant changes here
Changes in xmi.py.
# See https://github.com/dkpro/dkpro-cassis/issues/266
# The checking for each feature if it is a StringArray is rather slow, hence, we cache the results
is_instance_of_string_array_map = {}
# Post-process feature values
for xmi_id, fs in feature_structures.items():
t = typesystem.get_type(fs.type.name)
for feature in t.all_features:
feature_name = feature.name
value = fs[feature_name]
if feature_name == "sofa":
fs[feature_name] = sofas[value]
continue
if fs.type.name not in is_instance_of_string_array_map:
is_instance_of_string_array_map[fs.type.name] = typesystem.is_instance_of(fs.type.name,
TYPE_NAME_STRING_ARRAY)
if is_instance_of_string_array_map[fs.type.name]:
# We already parsed string arrays to a Python list of string
# before, so we do not need to work more on this
continue
To prepare a PR, you fork the repo, push the branch to your fork and then create the pull request. GitHub will usually even offer you to create the PR if you visit your branch in the web ui.
Thanks, I created a pull request here: https://github.com/dkpro/dkpro-cassis/pull/267
Is your feature request related to a problem? Please describe. I have a list of XMI CAS objects where I only saved a few, but rather complex types. I noticed that loading these types takes a rather long time. When I profiled the loading routine for 1000 CAS XMI, I noticed that about 40% of the time is spend in
typesystem.is_instance_of()
. This is because this function is called inCasXmiDeserializer.deserialize()
https://github.com/dkpro/dkpro-cassis/blob/main/cassis/xmi.py#L220 for each feature andis_instance_of
always recursively moves through the complex Type hierarchy.Profiling results
Describe the solution you'd like We should reduce the number of calls to
typesystem.is_instance_of()
by caching the results. I have prepared a small implementation. It speeds up the loading process by a third for me.Without caching
100%|██████████| 1000/1000 [02:25<00:00, 6.86it/s] 100%|██████████| 1000/1000 [02:26<00:00, 6.83it/s] 100%|██████████| 1000/1000 [02:26<00:00, 6.81it/s]
With cache
100%|██████████| 1000/1000 [01:30<00:00, 11.05it/s] 100%|██████████| 1000/1000 [01:34<00:00, 10.64it/s] 100%|██████████| 1000/1000 [01:38<00:00, 10.12it/s]
Describe alternatives you've considered I thought about wrapping the function
typesystem.is_instance_of()
directly inside a cache. This would be a cleaner implementation, but since we allow typesystem changes, we would need to reset the cache every time something changes.Additional context If needed, then I can provide the data for the loading benchmarks.