capi-workgroup / decisions

Discussion and voting on specific issues
8 stars 1 forks source link

gh-115754: Expose constants (ex: Py_None) as public ABI symbols (stable ABI) #16

Closed vstinner closed 8 months ago

vstinner commented 8 months ago

Currently, the C API implements Py_None, Py_True, Py_False, Py_Ellipsis and Py_NotImplemented constants by referring to private symbols at the ABI level:

#define Py_None (&_Py_NoneStruct)
#define Py_False _PyObject_CAST(&_Py_FalseStruct)
#define Py_True _PyObject_CAST(&_Py_TrueStruct)
#define Py_Ellipsis (&_Py_EllipsisObject)
#define Py_NotImplemented (&_Py_NotImplementedStruct)

In the stable ABI, I would like to avoid referring to private symbols, but only use public symbols.

The issue https://github.com/python/cpython/issues/115754 discussed two options:

Please read the issue https://github.com/python/cpython/issues/115754 discussion, especially @zooba's comment https://github.com/python/cpython/issues/115754#issuecomment-1959798283 which lists possible constraints of such API. Extract:

  • lazy type initialization
  • thread/interpreter-local types
  • synchronized/locked access
  • switch from singleton to non-singleton
  • always return new references
  • run-time library delegation (e.g. load python3.dll and have it forward to whichever python3x.dll it's using)
  • static analysis for reference counts
  • "fake" singletons
  • deprecation/warnings/errors
  • GIL enforcement

Please vote for your preferred option, or add a comment if you want a 3rd option.

I wrote https://github.com/python/cpython/pull/115755 which is a simple change implementing option (A).


I prefer option (A). I agree that option (B) avoids any potential/hypothetical issue in the future, but I don't want to write a perfect API/ABI, only make pragmatic changes.

Option (B) can have a (significant?) impact on performance, especially on hot code. So far, nobody ran microbenchmarks, no number is available.

I would prefer to only consider option (B) once we will deeply design the stable ABI. Right now, we are far from being able to support tagged pointers for example. I would prefer to not pay a performance overhead for a theoretical issue :-(

Note: Subinterpreters are sharing singletons thanks to immortal objects (PEP 683, Python 3.12).


By the way, I already added Py_Is(x, y), Py_IsNone(x), Py_IsTrue(x) and Py_IsNone(x) for PyPy. In PyPy, you cannot simply compare an object to a constant memory address (if (x == Py_None) ...), since PyPy objects can be moved in memory.

This issue is specific to CPython. Other Python implementations are free to use a different implementation and a different ABI.

vstinner commented 8 months ago

@gvanrossum:

I don't have time for this, but I'm surprised you haven't asked @ericsnowcurrently yet.

I wrote:

Note: Subinterpreters are sharing singletons thanks to immortal objects (PEP 683, Python 3.12).

In the past, I proposed adding Py_GetNone() and have per-interpreter singletons: https://github.com/python/cpython/issues/83692 Apparently, @ericsnowcurrently didn't like this idea and it was decided to go with immortal objects instead.

encukou commented 8 months ago

I prefer that stable ABI is useful for alternative implementations: they nicely show the design space that CPython might want to explore in the future. If we want to hide implementation details, looking at PyPy is a pretty good way to do it. Hence, I voted for option B.

I would prefer to only consider option (B) once we will deeply design the stable ABI.

I'd prefer that we do that now -- for new additions. If we don't want to design new API now, let's stick to what works.

ericsnowcurrently commented 8 months ago
  • (B) Implement "Py_None" API as a function call. Example: #define Py_None Py_GetNone() with PyObject* Py_GetNone(void), where Py_GetNone() cannot fail.

A similar option, which I believe @zooba suggested:

ericsnowcurrently commented 8 months ago
  • (A) Expose "Py_None" API as PyObject *Py_None symbol at the ABI level.

Objects exposed directly by the C-API were one of the few hard problems I had to solve for a per-interpreter GIL. If it hadn't been for immortal objects, I would have had to either do some messy macro trickery or introduce costly fixup code across most of the C-API implementation (using the existing symbols as . In addition to the extra complexity, the macro hacks would have probably been a further pain point for other compilers/implementations. The fixup code solution would have been fragile. It also doesn't help that we expose values for some of the objects (e.g. the non-exception builtin types) rather than just pointers. Note that even with object immortality I still had to add a trick for static builtin types, to accommodate tp_dict and tp_subclasses. Overall, the whole situation was a real pain.

Everything would have been drastically simpler if we had never exposed the objects directly in the first place. With that in mind, I firmly favor option (B). Let's move away from exposing objects directly in the C-API, rather than doing it more.

vstinner commented 8 months ago

Option (B) can have a (significant?) impact on performance, especially on hot code. So far, nobody ran microbenchmarks, no number is available.

Hum, I will try to work on a benchmark to have numbers. It may help to take a decision.

vstinner commented 8 months ago

I wrote https://github.com/python/cpython/pull/116572 to measure the performance overhead. I modified the constant everywhere for this benchmark, including inside Python internals. So the measurement is the worst case scenario.

Results with PGO, LTO and CPU isolation on Linux: https://github.com/python/cpython/pull/116572#issuecomment-1987822227

6 benchmarks are at least 1.05x slower:

| sqlite_synth                  | 3.73 us  | 3.90 us: 1.05x slower  |
+-------------------------------+----------+------------------------+
| typing_runtime_protocols      | 172 us   | 180 us: 1.05x slower   |
+-------------------------------+----------+------------------------+
| json_dumps                    | 15.5 ms  | 16.5 ms: 1.06x slower  |
+-------------------------------+----------+------------------------+
| pickle                        | 17.6 us  | 19.7 us: 1.12x slower  |
+-------------------------------+----------+------------------------+
| pickle_dict                   | 43.1 us  | 50.2 us: 1.17x slower  |
+-------------------------------+----------+------------------------+
| pickle_list                   | 6.64 us  | 7.87 us: 1.19x slower  |

3 benchmarks are at least 1.05x faster:

| coverage                      | 499 ms   | 152 ms: 3.29x faster   |
+-------------------------------+----------+------------------------+
| dask                          | 885 ms   | 628 ms: 1.41x faster   |
+-------------------------------+----------+------------------------+
| logging_silent                | 166 ns   | 155 ns: 1.07x faster   |

3x faster for coverage looks really strange. Maybe I didn't run benchmarks properly, I'm not sure.

vstinner commented 8 months ago

Since the majority seems to prefer function calls, I prepared a full PR which documents the change: https://github.com/python/cpython/pull/116572 It's ready to be merged, waiting for a decision :-)

@iritkatriel: Do you have an opinion on this decision? Tell me if you need more details.

iritkatriel commented 8 months ago

There doesn't seem much of a perf impact, so yeah I think option B makes more sense.

gvanrossum commented 8 months ago

Maybe ask Mike Droettboom to help with benchmarking this?

On Mon, Mar 11, 2024 at 3:25 AM Irit Katriel @.***> wrote:

There doesn't seem much of a perf impact, so yeah I think option B makes more sense.

— Reply to this email directly, view it on GitHub https://github.com/capi-workgroup/decisions/issues/16#issuecomment-1988080251 or unsubscribe https://github.com/notifications/unsubscribe-auth/AAWCWMUB42CE7FA74LDBFU3YXWBB5BFKMF2HI4TJMJ2XIZLTSOBKK5TBNR2WLJDUOJ2WLJDOMFWWLO3UNBZGKYLEL5YGC4TUNFRWS4DBNZ2F6YLDORUXM2LUPGBKK5TBNR2WLJDUOJ2WLJDOMFWWLLTXMF2GG2C7MFRXI2LWNF2HTAVFOZQWY5LFUVUXG43VMWSG4YLNMWVXI2DSMVQWIX3UPFYGLLDTOVRGUZLDORPXI6LQMWWES43TOVSUG33NNVSW45FGORXXA2LDOOJIFJDUPFYGLKTSMVYG643JORXXE6NFOZQWY5LFVE3TENZWHEZDENRQQKSHI6LQMWSWS43TOVS2K5TBNR2WLKRSGE3DSNJRGU2TCMFHORZGSZ3HMVZKMY3SMVQXIZI . You are receiving this email because you are subscribed to this thread.

Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub .

-- --Guido van Rossum (python.org/~guido)

vstinner commented 8 months ago

Maybe ask Mike Droettboom to help with benchmarking this?

I already ran a microbenchmark and pyperformance. Do you need more benchmarks? I'm not sure that I get your comment.

gvanrossum commented 8 months ago

Maybe ask Mike Droettboom to help with benchmarking this?

I already ran a microbenchmark and pyperformance. Do you need more benchmarks? I'm not sure that I get your comment.

I am used to having the benchmark output in the specific presentation that we use for our team's benchmarking: https://github.com/faster-cpython/benchmarking-public/blob/main/README.md

But I now see that you did run pyperformance and found that it doesn't have a significant effect so I'll accept that.

My vote would currently be (A), since I abhor the idea of a macro that looks like a variable reference but actually expands to a call. Didn't we introduce immortal objects specifically so that we could have a single None object shared between interpreters?

vstinner commented 8 months ago

I wrote 3 implementations so they can be compared:

So the majority prefers option (B): Implement "Py_None" API as a function call.

Between adding 5 functions for 5 constants and adding 2 functions for 10 constants, I prefer the "Py_GetConstantRef()" approach: only add 2 functions. Moreover, it gives the choice between borrowed and strong references, even if in practice, both are the same.

vstinner commented 8 months ago

I close the issue, the decision is to use function calls.

The discussion on the actual implementation can happen directly on the pull request.