Closed rcarlini-upf closed 4 years ago
I can reproduce the issue. This is fundamentally difficult to fix as cassis generates a class for every type in the typesystem and therefore a lot of garbage. I just pushed a fix to master which generates classes only on the first use (https://github.com/dkpro/dkpro-cassis/issues/84). With that fix, I get
In [1]: from cassis.typesystem import TypeSystemDeserializer
In [2]: deserializer = TypeSystemDeserializer()
In [3]: typesystem_path = "tests/test_files/typesystems/important_dkpro_types.xml"
In [4]: result = %timeit -n 1 -r 100 -o deserializer.deserialize(typesystem_path)
31.8 ms ± 2.96 ms per loop (mean ± std. dev. of 100 runs, 1 loop each)
That looks faster to me. Can you use master and try again? You can install master via pip with
pip install git+https://github.com/dkpro/dkpro-cassis.git
Is your project open source? I would love to see it, as we did something similar with https://github.com/inception-project/inception-external-recommender
It is not open source but I'll try to cleanup and open some parts of it. I've seen some ideas that I like in that inception-project and I'll try to add them while cleaning up.
I'll try the fix later but I have a question about it. In the long run, this fix does not avoid the cluttering, right? Because it seems to me that the problem comes from re-creating classes even if the definition is the same.
For small type systems, I assume that this problem is much smaller. For larger ones I think the garbage collector will take care of the cluttering, we also reduced the garbage as you only pay for types that you use. The fix also sped up creating type systems by a factor of ~100x so cluttering should have a much smaller impact now.
Ok, I've rerun the same process and got:
49.2 ms ± 3.01 ms per loop (mean ± std. dev. of 100 runs, 1 loop each)
which is quite an improvement.
For larger ones I think the garbage collector will take care of the cluttering, we also reduced the garbage as you only pay for types that you use. The fix also sped up creating type systems by a factor of ~100x so cluttering should have a much smaller impact now.
Makes sense. In the end, we didn't notice a memory impact, so I assume that you are right on the garbage collector point. Thank you for the clarification!
I'll contact you once I have integrated your ideas and opened our project :)
I will make a release with this soonish.
Describe the bug When typesystem is deserizalized using TypeSystemDeserializer shows an increasing amount of time, even when called with the same typesystem multiple times.
To Reproduce Steps to reproduce the behavior using ipython:
from cassis.typesystem import TypeSystemDeserializer
deserializer = TypeSystemDeserializer()
typesystem_path = "tests/test_files/typesystems/important_dkpro_types.xml"
result = %timeit -n 1 -r 100 -o deserializer.deserialize(typesystem_path)
The output is:
The slowest run took 7.17 times longer than the fastest. This could mean that an intermediate result is being cached. 3.88 s ± 1.74 s per loop (mean ± std. dev. of 100 runs, 1 loop each)
Looking at the timings from result.all_runs, it takes less than one second at the beginning and ends up taking close to 7s to load the typesystem (each repeated call increases 0.06 s ± 0.09 s).
Expected behavior A stable time performance, tied to the complexity of the typesysytem.
Error message Not applicable.
Please complete the following information:
Additional context I'm setting up a service to perform some additional analysis (using spacy, for example) so I receive an xmi and a typesystem each time the endpoint is called.