faster-cpython / ideas

1.67k stars 49 forks source link

Tips from Laurie Tratt #600

Open gvanrossum opened 1 year ago

gvanrossum commented 1 year ago

At the recommendation of a mutual friend I contacted Laurie Tratt (@ltratt), who wrote back with some useful tips. The rest of this comment is his message verbatim.


[...] I have thought a bit about some of the problems I think you're thinking about. A warning, though: in general, I think we've explored relatively little of the language implementation space; and often I find that people confuse "I did X" with "X is the best/only way of doing things". I'm probably not immune from this either!

My summary of the problem is: in a JIT we want to make assumptions about an object and optimise based on those assumptions; but in a multi-threaded environment another thread may quickly invalidate our assumptions. [I'll use the term "JIT", but interpret that liberally: it doesn't have to mean "generate machine code"; it could mean "quickening" etc. etc.]

I tend to think that there are two main approaches to mitigating this:

  1. Typically, most objects are created and die within a single thread. A JIT can often prove by "escape analysis" [1,2,3] that an object can't leak to another thread. Then a) no locking on that object is necessary b) the JIT knows that no other thread can invalidate its assumptions about that object.

    Escape analysis is more useful than is sometimes presumed: often a JIT can prove that an object can't leak up to a certain point in the JITed code. Rather than just assume the object has to always be treated with multi-threaded caution, the JIT can optimise it up to the escape point as single-threaded, then materialise it as a "full" object and treat it as a multi-threaded object from then on.

  2. Clever/fast locking strategies. Typically, even objects that are accessible in multiple threads are not mutated by those threads simultaneously. If you can acquire/release locks quickly in the (common) uncontended case, you win.

    HotSpot (aka "the JVM") has gone through several iterations to make locks fast (the most famous is "biased locking" but that was removed recently-ish [4]): I think they currently have a two-stage strategy ("lightweight" locks can get upgraded to "heavyweight" locks).

    JavaScriptCore (the JavaScript VM in WebKit/Safari) developed "parking lot" locks [5] which only take two bits and are very good for the uncontended case (better than OS locks!), and tolerable for the contended case. I tend to use this approach these days.

I don't pretend to know the precise details of the system you're working on, and thus whether these approaches are definitely what you want. But I think there is a decent chance that if the no-gil work goes ahead, these sorts of factors could prove at least interesting to you.

If any of this sounds interesting, I'd be happy to discuss further by email or video. In terms of West Coast friendly times, I've got availability tomorrow or Monday up to 20:00 BST (midday PDT), for example, if you need to discuss things sooner rather than later.

[Not directly related to this email, but a disclaimer: my group is doing research on automatically creating tracing JITs from C interpreters, so that we can JIT into C extensions. CPython is, unsurprisingly, one that we're very interested in! It's mid-stage work, at the moment, and we might have a demo out later this year. This would complement, rather than replace, other JIT work, so I don't think we're stepping on your toes or vice versa!]

Laurie

[1] https://en.wikipedia.org/wiki/Escape_analysis [2] https://www.pypy.org/posts/2010/09/escape-analysis-in-pypys-jit-1780048403046080197.html [3] https://kipp.ly/blog/escape-analysis/ [4] https://openjdk.org/jeps/374 [5] https://webkit.org/blog/6161/locking-in-webkit/