Question: about performance

jdvor commented 2 years ago

First of all, let me say that betterproto is absolutely beautiful project with so much better interface and behavior while developing than the corresponding python equivalent from Google. I have quite a lot of experience with Google's protobuf from other languages (C#, golang, even Java) and recently I had to use their python implementation and was surprised how bad it was to work with it. In fact, it probably could not be any worse; everything is assembled at runtime and therefore the generated wrappers are illegible both to humans and IDEs). I was curious why one would choose such alien and anti-developer solution and only thing I could think of was performance.

So I venture to test it here jdvor/pbwhy. It looks it is indeed the case.

Can anyone check whether there is some fault in the benchmark? (I'm not experienced python developer, so it is quite possible the fault might be in the benchmark itself)
If not, what might be the reason for such difference (~5x serialization, ~25x deserialization)? It is not a trivial one.

Gobot1234 commented 2 years ago

It's faster for a few reasons:

It's written in c++ by default. (I'd be interested to see what the performance numbers look like if you force it to use the python version.)
They have a lot of people working on it.
They do a lot of the code gen that betterproto does implicitly very very explicitly.

This is however something we are looking to improve.

jdvor commented 2 years ago

Thanks for the answer.

As for 1.) I would give it a try, but I can't find any mention of such possibility in the docs. It is probably a pure academic question anyway, because nobody in their right mind would opt in using Google's python protobuf horrible interface if it would not be compensated by drastically better performance.

jdvor commented 2 years ago

I've managed to create protobuf benchmark with pure python backend (turns out this is not something to choose at runtime, but during installation) and updated the results. Short story: serialization speed is roughly the same as betterproto and deserialization is ~2.5x faster. Overall without cpp implementation, there's no reason to choose Google's implementation, IMHO.

cetanu commented 2 years ago

There is definitely opportunity to do some profiling and see where the time is being taken. I don't think we've reached that point yet, at the moment all the focus has been on getting the package to a condition where it can work seamlessly alongside other libraries, and be used for grpc client and server implementations.

I don't have a ton of time to look into it but if you feel like doing some brief profiling in that pbwhy repo and seeing if you find anything obvious, we could open an issue and try for some optimizations.

Gobot1234 commented 2 years ago

Maybe take a look at #153

jdvor commented 2 years ago

I've added ability to run serialization and deserialization benchmarks separately (there's still an option to run both at once) and also added ability to CPU time profile / memory profile the benchmarks. So it should be somewhat comfortable for performance optimization work.

$ pipenv run deserialization --cpu_profile
         4010005 function calls (3920005 primitive calls) in 1.731 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    1.868    1.868 D:\dev\pbwhy\benchmark.py:75(deserialization_cpu)
        1    0.009    0.009    1.868    1.868 D:\dev\pbwhy\benchmark.py:61(deserialization)
    10000    0.010    0.000    1.858    0.000 run.py:21(deserialization)
40000/10000    0.287    0.000    1.554    0.000 D:\dev\.virtualenvs\betterproto-OCaYUMbC\lib\site-packages\betterproto\__init__.py:725(parse)
140000/80000    0.121    0.000    1.012    0.000 D:\dev\.virtualenvs\betterproto-OCaYUMbC\lib\site-packages\betterproto\__init__.py:684(_postprocess_single)
    40000    0.144    0.000    0.489    0.000 D:\dev\.virtualenvs\betterproto-OCaYUMbC\lib\site-packages\betterproto\__init__.py:505(__post_init__)
   180000    0.221    0.000    0.422    0.000 D:\dev\.virtualenvs\betterproto-OCaYUMbC\lib\site-packages\betterproto\__init__.py:408(parse_fields)
   390000    0.253    0.000    0.340    0.000 D:\dev\.virtualenvs\betterproto-OCaYUMbC\lib\site-packages\betterproto\__init__.py:535(__setattr__)
   240000    0.067    0.000    0.296    0.000 {built-in method builtins.setattr}
    80000    0.113    0.000    0.179    0.000 c:\python38\lib\dataclasses.py:1022(fields)
    10000    0.009    0.000    0.109    0.000 <string>:2(__init__)
   290000    0.074    0.000    0.100    0.000 D:\dev\.virtualenvs\betterproto-OCaYUMbC\lib\site-packages\betterproto\__init__.py:554(_betterproto)
   130000    0.056    0.000    0.099    0.000 D:\dev\.virtualenvs\betterproto-OCaYUMbC\lib\site-packages\betterproto\__init__.py:651(_get_field_default)
   260000    0.081    0.000    0.081    0.000 D:\dev\.virtualenvs\betterproto-OCaYUMbC\lib\site-packages\betterproto\__init__.py:382(decode_varint)
   640000    0.064    0.000    0.064    0.000 {built-in method builtins.getattr}
   340000    0.047    0.000    0.047    0.000 c:\python38\lib\dataclasses.py:1037(<genexpr>)
   270000    0.042    0.000    0.042    0.000 D:\dev\.virtualenvs\betterproto-OCaYUMbC\lib\site-packages\betterproto\__init__.py:161(get)
   390000    0.042    0.000    0.042    0.000 {built-in method builtins.hasattr}
    40000    0.021    0.000    0.021    0.000 D:\dev\.virtualenvs\betterproto-OCaYUMbC\lib\site-packages\betterproto\__init__.py:730(<dictcomp>)
   150000    0.018    0.000    0.018    0.000 {built-in method builtins.isinstance}
   180000    0.016    0.000    0.016    0.000 {built-in method builtins.len}
    50000    0.009    0.000    0.009    0.000 {method 'decode' of 'bytes' objects}
    80000    0.009    0.000    0.009    0.000 {method 'values' of 'dict' objects}
    20000    0.009    0.000    0.009    0.000 D:\dev\.virtualenvs\betterproto-OCaYUMbC\lib\site-packages\betterproto\__init__.py:279(_pack_fmt)
    20000    0.004    0.000    0.004    0.000 {built-in method _struct.unpack}
    20000    0.002    0.000    0.002    0.000 {method 'append' of 'list' objects}
        2    0.000    0.000    0.000    0.000 {built-in method time.perf_counter_ns}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

betterproto deserialization [10000]: 1867.57 ms

CPU: Intel64 Family 6 Model 142 Stepping 10, GenuineIntel, 8 cpu, 2.0 GHz
OS: win-amd64
Python: 3.8.10, CPython, 3d8993a

The best candidate for optimization is, I would say, betterproto\__init__.py:684(_postprocess_single). However I don't see any obvious optimizations.

It is up to you now, if you want to close this issue. You have answered my original question.

My conclusion is that betterproto and protobuf (with PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python) have roughly same performance, wire compatibility and hence no reason to choose the later. Implementation in native code, protobuf with PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=cpp, is an order of magnitude more performant (~8x serialization, ~30x deserialization).

chr1st1ank commented 1 year ago

I found this issue when hunting down the performance bottleneck in a streaming pipeline with relatively large protobuf messages. We had chosen betterproto because of the idiomatic Python classes it creates. Surprisingly it turns out that the protobuf handling is now our bottleneck, not the algorithms that process the messages.

I can't share the very messages, but I can share the results of a very simple benchmark setup which compares serialization and deserialization of 1000 real-life messages of our services. The benchmark shows that betterproto is more than two orders of magnitude slower than the google implementation:

---------------------------------------------------------------------------------------- benchmark: 4 tests ----------------------------------------------------------------------------------------
Name (time in ms)                   Min                 Max                Mean            StdDev              Median                IQR            Outliers       OPS            Rounds  Iterations
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_serialize_google_pb         1.9646 (1.0)        3.0055 (1.0)        2.1197 (1.0)      0.1241 (1.0)        2.0961 (1.0)       0.1175 (2.45)        59;16  471.7597 (1.0)         373           1
test_deserialize_google_pb       2.4722 (1.26)      11.6283 (3.87)       2.6150 (1.23)     0.7406 (5.97)       2.5396 (1.21)      0.0479 (1.0)          2;20  382.4022 (0.81)        289           1
test_serialize_betterpb        539.9312 (274.83)   545.0845 (181.36)   542.5575 (255.96)   1.8385 (14.82)    542.5318 (258.83)    1.7946 (37.48)         2;0    1.8431 (0.00)          5           1
test_deserialize_betterpb      974.3474 (495.95)   994.9041 (331.03)   983.7954 (464.12)   8.9757 (72.36)    980.1845 (467.62)   15.4577 (322.83)        2;0    1.0165 (0.00)          5           1
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

A flamegraph shows that the slowdown is mostly due to very few functions in the core of the library: grafik

I'm convinced that this is just due to missing optimization. Very understandibly that was less important than feature-completeness of the library.

If there's interest by the maintainers, I would be happy to perform a more in-depth analysis and come up with optimizations. Either in plain Python or in a small Rust extension. Please let me know, when there is interest.

danielgtaylor / python-betterproto

Question: about performance #314