dkpro / dkpro-cassis

UIMA CAS processing library written in Python
https://pypi.org/project/dkpro-cassis/
Apache License 2.0
85 stars 22 forks source link

Increasing times in typesystem deserialization #124

Closed rcarlini-upf closed 4 years ago

rcarlini-upf commented 4 years ago

Describe the bug When typesystem is deserizalized using TypeSystemDeserializer shows an increasing amount of time, even when called with the same typesystem multiple times.

To Reproduce Steps to reproduce the behavior using ipython:

  1. Import TypeSystemDeserializer from cassis.typesystem import TypeSystemDeserializer
  2. Instantiate it deserializer = TypeSystemDeserializer()
  3. Choose a typesystem to use for the test typesystem_path = "tests/test_files/typesystems/important_dkpro_types.xml"
  4. Run %timeit magic command result = %timeit -n 1 -r 100 -o deserializer.deserialize(typesystem_path)

The output is: The slowest run took 7.17 times longer than the fastest. This could mean that an intermediate result is being cached. 3.88 s ± 1.74 s per loop (mean ± std. dev. of 100 runs, 1 loop each)

Looking at the timings from result.all_runs, it takes less than one second at the beginning and ends up taking close to 7s to load the typesystem (each repeated call increases 0.06 s ± 0.09 s).

Expected behavior A stable time performance, tied to the complexity of the typesysytem.

Error message Not applicable.

Please complete the following information:

Additional context I'm setting up a service to perform some additional analysis (using spacy, for example) so I receive an xmi and a typesystem each time the endpoint is called.

jcklie commented 4 years ago

I can reproduce the issue. This is fundamentally difficult to fix as cassis generates a class for every type in the typesystem and therefore a lot of garbage. I just pushed a fix to master which generates classes only on the first use (https://github.com/dkpro/dkpro-cassis/issues/84). With that fix, I get

In [1]: from cassis.typesystem import TypeSystemDeserializer 
In [2]: deserializer = TypeSystemDeserializer() 
In [3]: typesystem_path = "tests/test_files/typesystems/important_dkpro_types.xml" 
In [4]: result = %timeit -n 1 -r 100 -o deserializer.deserialize(typesystem_path) 
31.8 ms ± 2.96 ms per loop (mean ± std. dev. of 100 runs, 1 loop each)

That looks faster to me. Can you use master and try again? You can install master via pip with

pip install git+https://github.com/dkpro/dkpro-cassis.git

Is your project open source? I would love to see it, as we did something similar with https://github.com/inception-project/inception-external-recommender

rcarlini-upf commented 4 years ago

It is not open source but I'll try to cleanup and open some parts of it. I've seen some ideas that I like in that inception-project and I'll try to add them while cleaning up.

I'll try the fix later but I have a question about it. In the long run, this fix does not avoid the cluttering, right? Because it seems to me that the problem comes from re-creating classes even if the definition is the same.

jcklie commented 4 years ago

For small type systems, I assume that this problem is much smaller. For larger ones I think the garbage collector will take care of the cluttering, we also reduced the garbage as you only pay for types that you use. The fix also sped up creating type systems by a factor of ~100x so cluttering should have a much smaller impact now.

rcarlini-upf commented 4 years ago

Ok, I've rerun the same process and got: 49.2 ms ± 3.01 ms per loop (mean ± std. dev. of 100 runs, 1 loop each) which is quite an improvement.

For larger ones I think the garbage collector will take care of the cluttering, we also reduced the garbage as you only pay for types that you use. The fix also sped up creating type systems by a factor of ~100x so cluttering should have a much smaller impact now.

Makes sense. In the end, we didn't notice a memory impact, so I assume that you are right on the garbage collector point. Thank you for the clarification!

I'll contact you once I have integrated your ideas and opened our project :)

jcklie commented 4 years ago

I will make a release with this soonish.