dkpro / dkpro-cassis

UIMA CAS processing library written in Python
https://pypi.org/project/dkpro-cassis/
Apache License 2.0
84 stars 22 forks source link

Speed up load_cas_from_xmi by improving offset_mapping and sofaString setter #290

Closed DavidHuebner closed 9 months ago

DavidHuebner commented 9 months ago

Is your feature request related to a problem? Please describe. I ran a profiler on a large amount of CAS XMI with varying size and relatively few annotations. I noticed two bottlenecks:

  1. The create_offset_mapping function requires about half of the loading time
  2. For each call to load_cas_from_xmi, the above create_offset_mapping appears to be called twice. Once when _parse_sofa() is called and once when the sofaString is set for the view.

Describe the solution you'd like

  1. On my tests, an LRU Cache helps improving the encoding method speed dramatically.
  2. We should check if we have a redundant call here and eliminate it, if possible.

I will prepare a Pull Request.

Additional context Profiler Screenshot:

load_cas_from_xmi_profiler

DavidHuebner commented 9 months ago

I created a Pull Request here: https://github.com/dkpro/dkpro-cassis/pull/291

It addresses both issues:

  1. An LRU Cache is used for speeding up the encode function and I refactored the create_offset_mapping
  2. When reading a CAS-XMI, I directly pass down the offset mappings to avoid recomputing them.

In my experiments with about 20k CAS XMI files, this reduces the overall workload from about 49% in the initial create_offset_mapping down to about 8% yielding an effective speed-up of over 1/3 (total time before=55s, total time after =32s) for reading CAS XMI.

Can you please have a look?

load_cas_from_xmi_profiler_after_optimization