danielgtaylor / python-betterproto

Clean, modern, Python 3.6+ code generator & library for Protobuf 3 and async gRPC
MIT License
1.51k stars 214 forks source link

Decoding performance in comparison to vanilla protoc #153

Open spectral-ai opened 4 years ago

spectral-ai commented 4 years ago

Hi,

I switched to betterproto from the vanilla Google protoc plugin for Python. So far I am really happy with the generated API and the benefits that come with it.

However, I observed that the decoding performance of binary protobuf messages decreased significantly with the switch to betterproto. In my tests the vanilla protoc plugin was approx. 250-300x faster.

Is this a known problem? I have seen that there has been work on the serialize/deserialize performance (e.g. https://github.com/danielgtaylor/python-betterproto/pull/46) and that there have been benchmarks added to the betterproto codebase. However, as far as I have seen these are checking for introduced bottlenecks in betterproto commits and do not compare the performance of betterproto to vanilla protoc.

I would be more than happy if someone could comment on this. I would really like to continue to use betterproto but this performance bottleneck is currently a blocker.

Many thanks in advance!

Jan

Gobot1234 commented 4 years ago

This kind of issue is being worked on, #144 should slightly aid with this, I was also planning on working on rust acceleration for this.

finete commented 3 years ago

@Gobot1234 asking out of curiosity, why Rust and not cython?

Gobot1234 commented 3 years ago

I wanted an excuse to learn Rust, I also think that it would be faster.

nat-n commented 3 years ago

I think extracting the parsing logic to a cython compiled module would be a good first step. Writing an extension in rust might give better performance but isn't as good a fit for this project.

spectral-ai commented 3 years ago

Thank you for looking into the matter! Some further experiments seem to indicate that performance decrease is most severe for messages containing repeated fields. Would one expect that as a result from issue #144?

Gobot1234 commented 3 years ago

I just linked the issue as it is a change that brings performance improvements, it might bring some boosts to speed for repeated fields but I haven't tested that.

wittmstSICKAG commented 3 years ago

Thank you for looking into the matter! Some further experiments seem to indicate that performance decrease is most severe for messages containing repeated fields. Would one expect that as a result from issue #144?

I did some further benchmarks and the serialization/deserialization performance decreases significantly with repeated fields. Since #144 was never merged into master, this might be worth a look.

Benchmark code:

import betterproto
from dataclasses import dataclass
from typing import List

@dataclass
class TestMessage(betterproto.Message):
    foo: int = betterproto.uint32_field(0)
    bar: str = betterproto.string_field(1)
    baz: float = betterproto.float_field(2)
    some_ints: List[int] = betterproto.int32_field(3)
    some_floats: List[float] = betterproto.float_field(4)

class BenchMessage:
    """Test creation and usage a proto message."""

    def setup(self):
        self.cls = TestMessage
        self.instance = TestMessage()
        self.instance_filled = TestMessage(0, "test", 0.0, [x for x in range(0, 1024*1024)], [x for x in range(0, 1024*1024)])
        self.instance_filled_serialized = bytes(self.instance_filled)

    def time_overhead(self):
        """Overhead in class definition."""

        @dataclass
        class Message(betterproto.Message):
            foo: int = betterproto.uint32_field(0)
            bar: str = betterproto.string_field(1)
            baz: float = betterproto.float_field(2)
            some_ints: List[int] = betterproto.int32_field(3)
            some_floats: List[float] = betterproto.float_field(4)

    def time_instantiation(self):
        """Time instantiation"""
        self.cls()

    def time_attribute_access(self):
        """Time to access an attribute"""
        self.instance.foo
        self.instance.bar
        self.instance.baz

    def time_init_with_values(self):
        """Time to set an attribute"""
        self.cls(0, "test", 0.0, [x for x in range(0, 1024*1024)], [x for x in range(0, 1024*1024)])

    def time_attribute_setting(self):
        """Time to set attributes"""
        self.instance.foo = 0
        self.instance.bar = "test"
        self.instance.baz = 0.0

    def time_serialize(self):
        """Time serializing a message to wire."""
        bytes(self.instance_filled)

    def time_deserialize(self):
        """Time deserializing a message from wire."""
        message = self.cls()
        message.parse(self.instance_filled_serialized)
$ poetry run asv dev
· Discovering benchmarks
· Running 8 total benchmarks (1 commits * 1 environments * 8 benchmarks)
[  0.00%] ·· Benchmarking existing-pyc__python-betterproto_.venv_scripts_python.exe
[  6.25%] ··· benchmarks.BenchMessage.time_attribute_access                                                                                                                                                                                        19.1±0μs 
[ 12.50%] ··· benchmarks.BenchMessage.time_attribute_setting                                                                                                                                                                                       18.8±0μs 
[ 18.75%] ··· benchmarks.BenchMessage.time_deserialize                                                                                                                                                                                              4.03±0s 
[ 25.00%] ··· benchmarks.BenchMessage.time_init_with_values                                                                                                                                                                                         113±0ms 
[ 31.25%] ··· benchmarks.BenchMessage.time_instantiation                                                                                                                                                                                           34.2±0μs 
[ 37.50%] ··· benchmarks.BenchMessage.time_overhead                                                                                                                                                                                                 492±0μs 
[ 43.75%] ··· benchmarks.BenchMessage.time_serialize                                                                                                                                                                                                2.01±0s 
[ 50.00%] ··· benchmarks.MemSuite.mem_instance                                                                                                                                                                                                          208 

Line_profiler output:

$ poetry run python -m line_profiler benchmarks.py.lprof
Timer unit: 1e-07 s

Total time: 20.2788 s
File: C:\python-betterproto\src\betterproto\__init__.py
Function: parse at line 827

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   827                                               @profile
   828                                               def parse(self: T, data: bytes) -> T:
   829                                                   """
   830                                                   Parse the binary encoded Protobuf into this message instance. This
   831                                                   returns the instance itself and is therefore assignable and chainable.
   832
   833                                                   Parameters
   834                                                   -----------
   835                                                   data: :class:`bytes`
   836                                                       The data to parse the protobuf from.
   837
   838                                                   Returns
   839                                                   --------
   840                                                   :class:`Message`
   841                                                       The initialized message.
   842                                                   """
   843                                                   # Got some data over the wire
   844         1        216.0    216.0      0.0          self._serialized_on_wire = True
   845         1         31.0     31.0      0.0          proto_meta = self._betterproto
   846         4      54199.0  13549.8      0.0          for parsed in parse_fields(data):
   847         3         69.0     23.0      0.0              field_name = proto_meta.field_name_by_number.get(parsed.number)
   848         3         23.0      7.7      0.0              if not field_name:
   849                                                           self._unknown_fields += parsed.raw
   850                                                           continue
   851
   852         3         28.0      9.3      0.0              meta = proto_meta.meta_by_field_name[field_name]
   853
   854                                                       value: Any
   855         3         73.0     24.3      0.0              if parsed.wire_type == WIRE_LEN_DELIM and meta.proto_type in PACKED_TYPES:
   856                                                           # This is a packed repeated field.
   857         2          9.0      4.5      0.0                  pos = 0
   858         2         11.0      5.5      0.0                  value = []
   859   2097154   12457570.0      5.9      6.1                  while pos < len(parsed.value):
   860   2097152   12891589.0      6.1      6.4                      if meta.proto_type in [TYPE_FLOAT, TYPE_FIXED32, TYPE_SFIXED32]:
   861   1048576    7039734.0      6.7      3.5                          decoded, pos = parsed.value[pos : pos + 4], pos + 4
   862   1048576    4927477.0      4.7      2.4                          wire_type = WIRE_FIXED_32
   863   1048576    6356895.0      6.1      3.1                      elif meta.proto_type in [TYPE_DOUBLE, TYPE_FIXED64, TYPE_SFIXED64]:
   864                                                                   decoded, pos = parsed.value[pos : pos + 8], pos + 8
   865                                                                   wire_type = WIRE_FIXED_64
   866                                                               else:
   867   1048576   45026476.0     42.9     22.2                          decoded, pos = decode_varint(parsed.value, pos)
   868   1048576    5425998.0      5.2      2.7                          wire_type = WIRE_VARINT
   869   2097152   28162473.0     13.4     13.9                      decoded = self._postprocess_single(
   870   2097152   68894603.0     32.9     34.0                          wire_type, meta, field_name, decoded
   871                                                               )
   872   2097152   11550145.0      5.5      5.7                      value.append(decoded)
   873                                                       else:
   874         1         19.0     19.0      0.0                  value = self._postprocess_single(
   875         1         67.0     67.0      0.0                      parsed.wire_type, meta, field_name, parsed.value
   876                                                           )
   877
   878         3        379.0    126.3      0.0              current = getattr(self, field_name)
   879         3         21.0      7.0      0.0              if meta.proto_type == TYPE_MAP:
   880                                                           # Value represents a single key/value pair entry in the map.
   881                                                           current[value.key] = value.value
   882         3         43.0     14.3      0.0              elif isinstance(current, list) and not isinstance(value, list):
   883                                                           current.append(value)
   884                                                       else:
   885         3        278.0     92.7      0.0                  setattr(self, field_name, value)
   886
   887         1          6.0      6.0      0.0          return self

From that it looks as if the bottleneck would be the decode_varint and _postprocess_single functions. Any idea how to speed this up? 2s serialization and 4s deserialization is quite long compared to vanilla protobuf.

Gobot1234 commented 3 years ago

Yes I have some further ideas for improving speeds. I'll see how it goes

uditrana commented 12 months ago

Is there any update to this?

Protobufs with repeated fields (we are passing around dataframes) seem to serialize 50x faster with vanilla protoc compared to better-protoc

Gobot1234 commented 12 months ago

See #520

uditrana commented 12 months ago

I see! I got really excited for this package because it allows for really intuitive conversions between dataframes and protobufs via dictionarys! (i.e. proto_lib.Data().from_dict(df.to_dict('list')) )

But given how slow serialization is, I will have to wait for that to improve and mature before we adopt it.

Blackclaws commented 8 months ago

Just to clarify, #545 was merged which means we can use betterproto-rust-codec to accelerate deserialization?

I'm not feeling particularly great about installing another package that is hosted by a different person entirely as an optional dependency. Not to be too picky, but this isn't the wild npm west where you just pull in hundreds of dependencies. Is there any future plan where this functionality is integrated into the main project?

Gobot1234 commented 8 months ago

Just to clarify, #545 was merged which means we can use betterproto-rust-codec to accelerate deserialization?

I'm not feeling particularly great about installing another package that is hosted by a different person entirely as an optional dependency. Not to be too picky, but this isn't the wild npm west where you just pull in hundreds of dependencies. Is there any future plan where this functionality is integrated into the main project?

Yes that is the aim at some point but till the tooling gets better there was no way to have it be in the repo as an optional dep or something. Have a look through the slack if you want more info about it

heimalne commented 2 months ago

Tackling the performance issues would be of great use for real world usage. Here are some running times that i can share:

14MB json file that was loaded into a betterproto object and made roundtrips into bytes and json on an M1 Macbook Pro: package serializing_to_bytes parsing_from_bytes serializing_json_string parsing_from_json_string
betterproto 2.2.0.0b6 10.3s 3.7s 5s 3.4s
betterproto-git 5fdd0bb 4s 7s 13.6s 4.7s
betterproto-rust (only affects bytes) 1.1s 4.9s 13.3s 4.8s
protoc-python 0.01s 0.01s 0.59s 0.86s

While the rust-codec is much faster, especially for serializing to bytes, the parsing is still slower than betterproto was at the last preview release. And somehow the jsons string serializing speed got worse during the last year.

Gobot1234 commented 1 month ago

Hmm, might be useful to bisect these not quite sure how it could have got so much slower.