Open patiences opened 6 years ago
@patiences This is a really solid first draft - great work!
A couple of suggestions:
It's great that you've included benchmarking in week 7; however, benchmarking is most useful when it's an ongoing measurement process. I'd suggest getting the benchmark situation sorted out first, so you can run the benchmark each week (or after each change) as a measure of ongoing progress.
Interning strings is a good idea, but I'm not sure if the approach you've proposed will work (at least, not as I've understood what you've described). The constant table in a class file needs to be Java primitives and references; Python strings are instances of a Python string class. So, it won't be possible to intern the Python string instances in the constant table; you'll need a different approach.
Of course, another approach would be to modify VOC to use Java strings in place of a separate string class - which require much less messing around with type conversion, but would mess up the internals a lot. However... I suspect it would also be a huge performance bump. Given that you've found conflicting opinions about whether StackMapFrames are worth the effort, our time might be better served looking to see if modifying VOC internals to use Java primitives rather than classes for Python primitives might be worthwhile. Even a speculative experiment that doesn't end up with production code might be a better use of GSoC time than implementing a bytecode structure that doesn't actually benefit performance.
The constant table in a class file needs to be Java primitives and references; Python strings are instances of a Python string class. So, it won't be possible to intern the Python string instances in the constant table; you'll need a different approach.
Ah, right!
another approach would be to modify VOC to use Java strings in place of a separate string class
Do you mean that python methods that operate on strings (Dict#getitem, for example?) should take java.lang.String strings instead?
Given that you've found conflicting opinions about whether StackMapFrames are worth the effort
Yeah, unfortunately I've seen multiple times that the performance gain was more of a theoretical advantage and has not been realized. :-( But wouldn't implementing StackMapFrames be a good idea since they are no longer optional in Java 8 -- is that a good enough reason?
Regarding strings: I was thinking that the potential optimisation is to completely remove any concept of an org.python.types.String, and replace them with java.lang.String (and similar with Integer, Boolean, and so on). Essentially, we'd be making a special case of Python primitive types, making them fall back to Java primitives.
The catch is that we'd need to provide all the utility methods that Python provides on it's primitive types (e.g., str.format
) as a static method somewhere, and if you ever try to invoke a method on a primitive, it would be redirected to the static method.
It wouldn't be a trivial change, though - at the moment, we can assume every object is an org.python.Object; if we start using Java primitives, that distinction would be lost.
As for whether SMF's are worth adding... that's a good question. It's possibly a good move to still add them; at the moment, we can only generate Java 6 bytecode, and while Java VMs are completely backwards compatible, and J6 bytecode will continue to run, it means that if Java ever adds anything interesting (and they already have in J7 and J8), we're might start to hit problems if our Java source code (for the Python standard library) starts to use those features.
Does this mean that users that write Java extensions for their Python code also have to use Java 6 (I don't know how often this happens)? That seems a bit outdated, I personally can't imagine writing java code without J8 lambdas. If this is the case StackMapFrames would probably be a very good idea.
On the other hand, I think the special-casing Python primitive types would be more interesting work for me. :-)
Combining Java6 and Java8 classes in a single executable shouldn't be a problem... but those are always famous last words :-) And, we don't know what is going to be added in Java9 that might cause compatibility problems with Java6. Pinning ourselves to a known-deprecated format isn't a good idea in general.
Honestly, either StackMapFrames or trying to make better use of primitives would be a viable project; StackMapFrames is better defined, but more detailed; primitive use is a lot more exploratory.
@freakboy3742 Thanks for all the insights! :-)
I've updated my proposal to address your comment about getting the benchmarking down in the first week. I've also replaced the work on implementing StackMapFrames with the primitives types optimizations we've chatted about in the comments here, as it is more interesting work for me and will likely have a huge performance impact. I've also added an extra week to the variable lookup work period -- I've been thinking about how to update the jump addresses and I'm still not sure so I've factored in some more time for that.
Let me know what you think!
This is looking pretty good - certainly good enough to submit as a final proposal.
This project was selected for the 2018 GSoC.
At the end of week 1, here's what's been happening...
Started working on small integer preallocation. After meeting with Russ on Wednesday afternoon and getting answers to my questions, had enough information to come up with a WIP commit (https://github.com/patiences/voc/commit/22f506e1be073bdb00948a734364c66c71f34f69) but unit tests currently with a cryptic Java bytecode-related error.
Started working on writing an initial performance test suite, on which I can expand throughout GSoC. Looked at PyBench, PyStone and the performance and perf (https://perf.readthedocs.io/en/latest/) modules, which I'd like to try to use. Would love some feedback generally on whether or not it's a good idea to use the perf library, or to build an internal benchmarking/timing mechanism. WIP commit here: https://github.com/patiences/voc/commit/e385557954dc1381ae83b514022cc911e31b28c1. Weirdly enough running this test also gives me a stack map related error, which I need to investigate.
Having trouble running ASM, which I need to resolve ASAP to move forward. :) Edit: Thanks Russ!
Week 2 finished! I've been working on:
Small integer and boolean pre-allocation. Code here and here. I am currently investigating the effect on in-place operations, once that is fixed, should be ready for PR.
Performance benchmarking test suite, code here. Should be ready for PR soon as well.
No blockers, and plenty of work to do coming into week 3! :-)
End of week 3 update:
Made progress on small integer and boolean preallocation by further restricting Integer and Boolean construction in the codebase on the Java side rather than just the transpilation side. This should greatly improve performance of Integers and Booleans that are created as a result of executing the code (range
, for example), and speaking of which:
I've been working a lot on how to demonstrate the performance gain of small integer and boolean preallocation. I realised that I wasn't properly measuring the execution time which was causing a lot of instability in the tests but I now have a system. :-) Will have results to share early next week.
Hoping to wrap up these 3 PRs this coming week, and starting work optimizing global loads!
End of week 4 update:
Boolean optimization is merged and small integer optimization and the benchmarking test are close. Hooray! These optimizations actually have created considerable performance improvement as can be seen by the performance tests results which is super exciting.
At the end of the week I started working on optimizing global loads (https://github.com/pybee/voc/pull/833), and will be continuing to work on that this week.
Thanks @freakboy3742 for spending extra time with me this week, even finding time to pair program! :-)
End of week 5 update (sorry late!):
Small integer optimization and the benchmarking test are merged! Yay! The integer optimization definitely gave us a performance boost (though not as drastic as the boolean optimization), and the test suite provides a way to detect changes that affect performance.
This week has been all about working on optimizing global loads. I opened a number of pull requests experimenting with different types of caching, and after meeting with Russ, decided to focus on trying to cache individual modules. Unfortunately I didn't manage to get pystone working (trying to run it through voc
results in a org.objectweb.asm.tree.analysis.AnalyzerException: Error at instruction 3882: Incompatible stack heights
), I may look at that in a future week. This week my goal is to get one optimization working.
End of week 6!
Next week I'll look at getting pystone working as well, as it would be super beneficial, and perhaps doing some profiling. On another note, I've written a blog post about my first month working on VOC, check it out if interested :-)
End of week 7 update:
This week I spent a lot of time on https://github.com/pybee/voc/pull/839, trying to optimize GeneratorFunctions and I think I've found out why it's not working/why it's not a good idea. I'm going to spend just a little more time this week wrapping this up, and then it'll be ready for review.
I've been working on writing a performance test for https://github.com/pybee/voc/pull/845 to show the performance improvement, but I haven't been able to write such a test. I think this is because while there are savings any time we can create fewer objects, the extensive null checks I added to support passing null as an argument on the java side perhaps cancels out any improvement. So it might be the case that there is no actual performance improvement. I'll check with Russ next week to see what we want to do with this PR.
This week I was also able to get pystone running which revealed a bug (fix here). These PRs should be merged soon, and I'll be taking a closer look at the pystone tests to see if we can identify some more candidates for performance optimization there.
End of week 8 update (late because I was away Jul 5-13):
Hooked up pystone PR! The difference on my machine between the beginning of May (before I started working on my GSoC project) and on master now is about 2.8x faster (from ~500 pystones/sec to ~1400 pystones/sec) which is crazy!
Merged https://github.com/pybee/voc/pull/852 -- a bug fix (mentioned above).
Merged https://github.com/pybee/voc/pull/869 -- an optimization that comes from not
creating __code__
attributes on functions. An issue to track __code__
inspection in the future is here.
Almost there with https://github.com/pybee/voc/pull/864 -- this one has been dragging because I've had so much trouble showing performance improvement by benchmarking. It turns out that this is because setting up for a function invocation (the site of this optimization) is nothing compared to how much time the function invocation actually takes. I've verified that the generated bytecode is smaller, so there is a little more cleaning up to do and then it will be ready for review.
I've written another blog post about the module optimization from this past month (https://github.com/pybee/voc/pull/839), if anyone wants to know more about that. :-)
End of week 9:
Merged https://github.com/pybee/voc/pull/875. This is an optimization that targets comparison operations (like >
, ==
, etc) due to org.python.types.Method
s (unneccessarily) being created for each operation. I'm happy to say that it's resulted in an huge performance improvement on pystone!
There's a bit more work to do on https://github.com/pybee/voc/pull/877, a few small bugs that came out of working on #875 above. I'll wrap this up early next week.
Next week, I'll continue working through some potential optimizations based on results I've gotten from JProfile, so there's plenty more work to do!
End of week 10 update:
Merged https://github.com/pybee/voc/pull/881. This is an optimization targeting nested loops, by reusing the same instance of StopIteration
s. This does however introduce some behavioral differences between VOC and CPython, documented here: https://voc.readthedocs.io/en/latest/reference/differences.html
This past week I've been profiling small pieces of code that have exposed a few small bugs, including https://github.com/pybee/voc/issues/882, https://github.com/pybee/voc/issues/885, https://github.com/pybee/voc/issues/886, https://github.com/pybee/voc/issues/888, https://github.com/pybee/voc/issues/890.
Open PRs include 1) https://github.com/pybee/voc/pull/877, which is some bug fixes and refactoring around __contains__
/__not_contains__
. 2) An optimization targeting __dict__
creation which needs a few kinks worked out, and 3) the beginning of some work on reducing the number of org/python/types/Str
objects created (https://github.com/pybee/voc/pull/892).
End of week 11 update:
Merged https://github.com/pybee/voc/pull/893. This was supposed to be just a cleanup of org/python/types/Str
, but it turned out to be a decent optimization especially for the print
function. (See the PR for benchmarking results)
Merged https://github.com/pybee/voc/pull/899. This is an optimization for org/python/types/Dict
operations, which are ubiquitous in Python and also in the Pystone benchmark (we saw an amazing performance improvement on Pystone with this one, details in the PR).
Decided not to continue pursuing https://github.com/pybee/voc/pull/880, because it wasn't showing performance improvement and I was a bit mistaken about the way that Java initializes HashMaps.
Logged a few more bugs (https://github.com/pybee/voc/pull/889, https://github.com/pybee/voc/pull/904) and a quick refactoring (https://github.com/pybee/voc/pull/894).
This coming Tuesday is the official cutoff for GSoC, but I will continue working for about a week after. I'm currently working on a blog post for the website, and wrapping up a few more PRs this coming week.
End of week 12 update:
Merged https://github.com/pybee/voc/pull/910. This is the same optimization as https://github.com/pybee/voc/pull/899, for Dict.__setitem__()
.
Still working on https://github.com/pybee/voc/pull/902, which still needs some cleaning up. The optimization is about creating standard org/python/Object functions at demand, not at module initialization time, which speeds up our initialization time.
Started working on https://github.com/pybee/voc/pull/909 , which is about method caching. Method calls are currently very expensive in VOC because each time we call a method, we create it anew. This optimization yields a significant performance boost, but we are still figuring out how to handle __dict__
inspection if method objects are cached there.
Started working on https://github.com/pybee/voc/pull/905 which is about adding a python base object
test suite. I'm having a lot of trouble with getting the tests passing, it looks like there are a lot of bugs around this (some of which, if not too hairy, I'm also addressing in the PR). Still more work to be done for this.
https://github.com/pybee/voc/pull/911 is ready for review -- it's a clean up of (what seems to be) legacy code that creates unnecessary objects.
Started working on https://github.com/pybee/voc/pull/913, which is part of a bug fix for https://github.com/pybee/voc/issues/907 and https://github.com/pybee/voc/issues/912.
I'm also currently working on hooking up a few more benchmarks from http://speed.pypy.org/comparison/ (which is where I'm finding all these new bugs!). I will try to wrap these up early this week if I can as I am planning on working as part of GSoC until Wednesday (but will be around to finish up anything that needs a little more time of course).
This proposal intends to improve the performance of VOC-generated bytecode by adding piecemeal optimizations.
Table of Contents
Motivation
The current implementation of Python AST to Java bytecode transpilation in many cases takes a naive approach, resulting in redundant bytecode instructions being produced and class files being longer than necessary. Not only does this make the code run slower than it should, this causes problems in some cases because the JVM enforces a size limit on class files, in particular on method sizes: each method must be less than 64KB.
This proposal explores optimizations that cut down the number of generated bytecode instructions.
Goals
This is a list of criteria to use as an evaluation of the success of this project.
Overview of Changes
Reducing Instances of Variable Creation
VOC-generated bytecode contains a lot of Object creation and initialization because everything in Python is an
Object
. Being frugal when creating string and integer objects, in particular, will result in huge savings.Primitive Types
The Python
Object
abstraction is preserved in BeeWare's java implementation of the built-in types. However, this means that every time one of these objects is used instructions forObject
creation must be generated. Working directly with the java primitives rather than wrapping them inObject
would give a significant performance boost. The standard python typesint
,float
,complex
,Str
,bool
andNone
are considered here as the first candidates for refactoring.Integers
CPython keeps an array of integer objects for all integers between -5 and 256. When an integer is created in that range a reference to the existing object is returned. It should be possible for VOC to do the same, that is, instead of generating bytecode that creates a new
Integer
object, any integer in [-5, 256] will refer to the preallocated integer.Improving Variable Lookup
When a variable is read, instructions for loading global or local variables are generated. However, if the variable is looked up again, VOC has to generate the same loading instructions again. This results in a lot of duplicated bytecode. VOC should cache values after the first lookup of global variables, modules, and functions whenever possible.
Considerations
Proposed Timeline
Preparation
Until April 23 (Student Proposal Acceptance)
April 23 - May 14 (Community Bonding Period)
Integer Pre-allocation
Week 1: May 14 (Official Coding Start) - May 20
small_ints
.small_ints
array instead of creating integer objects.Variable Lookup Optimization
Week 2: May 21 - May 27
Week 3: May 28 - June 3
Week 4: June 4 - June 10
Week 5: June 11 (Phase 1 Evaluation due 15th) - June 17
Week 6: June 18 - June 24
Week 7: June 25 - July 1
Primitive Types Optimization
Week 8: July 2 - July 8
Week 9: July 9 (Phase 2 Evaluation due 13th) - July 15
Week 10: July 16 - July 22
Week 11: July 23 - July 29
Week 12: July 30 - August 5
Note: The official coding period is May 14 - August 14, which is a 13 week period. I've accounted for 12 weeks of work here.
Risk Analysis
There are many gaps in my technical knowledge:
I will try to mitigate this by keeping my mentor updated of my progress, updating this timeline as needed, and asking for help early.