apache / pulsar-client-python

Apache Pulsar Python client library
https://pulsar.apache.org/
Apache License 2.0
50 stars 42 forks source link

[Investigation] Fully-native Python client #85

Open erichare opened 1 year ago

erichare commented 1 year ago

I'm rather new to this project but, I thought it might be worthwhile to gauge the community's thoughts on implementing a fully-native Python client. Those of you who have a more intimate understanding of the internals, my questions would be:

Also, if there is a better venue for starting this conversation (i.e., the main pulsar repo, the C++ client repo, or the dev mailing list) let me know and i'll close this and start it there. I just had the thought that maybe those who have worked with the Python client specifically might have a clearer understanding of the pros and cons of going in that direction.

merlimat commented 1 year ago
  • Is it feasible that a native Python client be developed, or are there any potential show-stopping limitations (i.e., the C++ client can handle some task that Python isn't well-suited for)

Everything is "feasible", though not necessarily the best option.

  • Given that it is feasible, is the investment of effort worthwhile? Or is wrapping functionality from the C++ client a better approach at least in the short-medium term?
  • Any other thoughts along these lines?

The main question to ask is: which problem would that be solving? What kind of improvement would that be compared to the current client?

That is what would have to be weighted against.

erichare commented 1 year ago

Definitely agree with your point here @merlimat . I don’t have a clear enough answer to that yet, but off hand my feeling would be that it would open up development of the client to a universe of python developers that may have a particularly vested interest in improvements being made.

But yes, the trade off between whether it would lead to any tangible improvements relative to the time investment is definitely questionable, which was sort of why I thought it would be good to begin a dialogue about it.

Appreciate the feedback!

BewareMyPower commented 1 year ago

It could be more friendly to Python developers because C++ is a complicated language for most developers that do not have C++ experiences. It's welcome to write another pure Python implemented client, like the native Go client.

However, from my perspective, it's a little different with Golang. Golang compiles its native binaries, while Python interpreter (I mean CPython here) actually works by loading Python scripts or C dynamic libraries . Writing fully-native Python client might bring some unnecessary performance overhead. In addition, a fully-native Python client is only more friendly for Python developers, not Python users.

Samreay commented 1 year ago

To echo the above comment, I just raised https://github.com/apache/pulsar-client-python/issues/115 and was going to see if I could open a PR for it myself. Without a native python client, the barrier to entry on contributing is quite high.

On top of this, there's also the value lost for developers when using the library itself. Things like type hints and the internal options are obfuscated.

For a concrete example of this, all the imports from _pulsar like ConsumerType have no autocomplete and no python documentation from the code itself. Even the API doc (https://pulsar.apache.org/api/python/2.4.0/) in the subscribe section simply has a default value of _pulsar.ConsumerType.Exclusive but no details on what other ConsumerTypes there are.

In fact, to figure out what was available, I ended up searching the C++ client repo for "ConsumerType" and ending up at https://github.com/apache/pulsar-client-cpp/blob/4a864f20530e230ecee7e4ab09430bd65fb9789c/include/pulsar/ConsumerType.h#L28

It has the information I wanted, but in terms of a smooth developer experience, having to manually search C++ code to figure out what I can pass in to my python signature is less than ideal. Happy to be part of the solution here though, and if we migrate to Python I'm very happy to contribute as much as possible.

BewareMyPower commented 1 year ago

in the subscribe section simply has a default value of _pulsar.ConsumerType.Exclusive but no details on what other ConsumerTypes there are.

It's an issue with the Python API docs generation for the _pulsar module, which is a C++ wrapper written based on the PyBind11. (BTW, 2.4.0 is too old) It seems currently the Python docs generation tool does not generate the docs for the _pulsar module. /cc @tisonkun

On the other hand, PyBind11 support adding the Python API docs via the C++ code. Currently the docs were not good. For example, I typed help(_pulsar.ConsumerType) and see:

 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |
 |  Exclusive = <ConsumerType.Exclusive: 0>
 |
 |  Failover = <ConsumerType.Failover: 2>
 |
 |  KeyShared = <ConsumerType.KeyShared: 3>
 |
 |  Shared = <ConsumerType.Shared: 1>
 |
 |  ----------------------------------------------------------------------
 |  Static methods inherited from pybind11_builtins.pybind11_object:
 |
 |  __new__(*args, **kwargs) from pybind11_builtins.pybind11_type
 |      Create and return a new object.  See help(type) for accurate signature.

We can enhance the API docs later.

tisonkun commented 1 year ago

@BewareMyPower This can be a known issue tracked at https://github.com/twisted/pydoctor/issues/658 which I don't have time to investigate at that time.

tisonkun commented 1 year ago
image

I workaround it with comment in https://github.com/twisted/pydoctor/issues/658. Although, it may require a preinstalled _pulsar module. Perhaps we can find a better way in API docs automatic generation.

zbentley commented 1 year ago

a fully-native Python client is only more friendly for Python developers, not Python users.

I disagree. A lot of issues with python binary extensions in general boil down to them being implemented in non-Python languages. Those issues fall into three major categories:

BewareMyPower commented 1 year ago

@zbentley Your points generally make sense to me. I have some more additional points.

  1. Yeah it's hard for Python users to test a wheel for latest master. Though we can trigger the GitHub action to generate wheels by pushing a tag in your own fork, I must admit that the workflows are not optimized. We have to wait until all workflows are done, which could take 3+ hours, while a single workflow only takes 10+ minutes. (It can be reduced if we have a pre-built image with all dependencies built)
  2. These issues are nearly all related to the GIL, which must be taken very carefully when developing the C extension for Python. pybind11 simplifies the APIs but we still need to write GIL-safe C++ code.
  3. The main issue is related to the multi-processes. Because of the implictly called destructors, the C++ classes are not guaranteed to be safely shared among various processes like the trivial C structs. Other issues are still related to GIL.

I think point 2 and 3 are still within the scope of the development. Though it makes it hard for Python users to debug issues.

zbentley commented 1 year ago

@BewareMyPower perhaps, though issues related to 2 and 3 largely occurred for me while trying to use Pulsar from ordinary python code, rather than trying to develop features on the client itself.

I think part of the issue is that GIL-safe code in this repo is very hard to write; it's easy to be GIL-safe when the handoff between C++ and Python is "unidirectional"--that is, Python code calls into C++, which returns when finished back into Python. But it's much harder when things are bidirectional, and the compiled code can later call back into Python, which the Pulsar client does quite frequently.