[GSoC] Optimizing the VOC Compiler

patiences commented 6 years ago

This proposal intends to improve the performance of VOC-generated bytecode by adding piecemeal optimizations.

Motivation

The current implementation of Python AST to Java bytecode transpilation in many cases takes a naive approach, resulting in redundant bytecode instructions being produced and class files being longer than necessary. Not only does this make the code run slower than it should, this causes problems in some cases because the JVM enforces a size limit on class files, in particular on method sizes: each method must be less than 64KB.

This proposal explores optimizations that cut down the number of generated bytecode instructions.

Goals

This is a list of criteria to use as an evaluation of the success of this project.

All changes outlined in this proposal are implemented;
All changes are tested with a high level of code coverage;
Any problematic areas or potential future optimizations are documented;
Weekly progress updates are given orally/in writing so mentors, community and other GSoC students know what I'm working on.

Overview of Changes

Reducing Instances of Variable Creation

VOC-generated bytecode contains a lot of Object creation and initialization because everything in Python is an Object. Being frugal when creating string and integer objects, in particular, will result in huge savings.

Primitive Types

The Python Object abstraction is preserved in BeeWare's java implementation of the built-in types. However, this means that every time one of these objects is used instructions for Object creation must be generated. Working directly with the java primitives rather than wrapping them in Object would give a significant performance boost. The standard python types int, float, complex, Str, bool and None are considered here as the first candidates for refactoring.

Integers

CPython keeps an array of integer objects for all integers between -5 and 256. When an integer is created in that range a reference to the existing object is returned. It should be possible for VOC to do the same, that is, instead of generating bytecode that creates a new Integer object, any integer in [-5, 256] will refer to the preallocated integer.

Improving Variable Lookup

When a variable is read, instructions for loading global or local variables are generated. However, if the variable is looked up again, VOC has to generate the same loading instructions again. This results in a lot of duplicated bytecode. VOC should cache values after the first lookup of global variables, modules, and functions whenever possible.

Considerations

Converting to SSA form. Static Single Assignment form is a property of bytecode that requires each variable to be assigned exactly once and can have huge benefits in simplifying some types of optimizations, such as redundancy elimination (common sub-expression elimination, constant propagation). However, converting bytecode to SSA form is nontrivial and it's unclear whether VOC actually needs these types of optimizations. If in the future this becomes more valuable to implement the work here should still be relevant.

Proposed Timeline

Preparation

Until April 23 (Student Proposal Acceptance)
- Get to know codebase as suggested changes require a deep understanding of VOC
- Take on small coding tasks to learn the development process and about working with BeeWare contributors
- Respond to any further questions or comments from BeeWare about my proposal, if any
April 23 - May 14 (Community Bonding Period)
- Same as above
- Get more comfortable reading and writing bytecode.

Integer Pre-allocation

Week 1: May 14 (Official Coding Start) - May 20
- Familiarize myself with performance benchmarking tools so I can evaluate the effectiveness of the optimizations in the coming weeks.
- Implement preallocation of a small integer array small_ints.
- Add logic for generating a lookup in the small_ints array instead of creating integer objects.

Variable Lookup Optimization

Week 2: May 21 - May 27
- Write a method that makes a pass over the generated bytecode. This method allows identification of variables that are loaded multiple times, and a mechanism to mark those locations, so that on the next pass the instructions can be changed in the right places.
Week 3: May 28 - June 3
- Write a method that inserts instructions to cache values of global variables after the first load, and remove subsequent reloads with a local load
Week 4: June 4 - June 10
- Write a method that updates the jump addresses that change as a result of the above.
Week 5: June 11 (Phase 1 Evaluation due 15th) - June 17
- Write a method that inserts instructions to cache values of modules that are loaded multiple times, and remove subsequent reloads with a local load
- Write a method that handles updating the jump addresses
Week 6: June 18 - June 24
- Write a method that inserts instructions to cache the value of functions after the first load, and remove subsequent reloads with a local load (this one might be a bit trickier, we'd have to make sure that other methods don't modify what was loaded)
- Write a method that handles updating the jump addresses
Week 7: June 25 - July 1
- Integration testing. I'm a little unsure about how these optimizations will combine and would like to write larger scale tests.

Primitive Types Optimization

Week 8: July 2 - July 8
- Discuss with mentor regarding where to put the static methods and how their use changes, as well as how to make these changes incrementally since the standard types are ubiquitous in Python code and in the codebase.
Week 9: July 9 (Phase 2 Evaluation due 13th) - July 15
- Refactor Str.java and provide all utility methods statically.
- Replace uses of Str methods in the codebase with calls to the static methods.
Week 10: July 16 - July 22
- Refactor Int.java and provide all utility methods statically
- Replace uses of Int methods in the codebase with calls to the static methods.
Week 11: July 23 - July 29
- Refactor Float.java and Complex.java and provide all utility methods statically
- Replace uses of Float and Complex methods with calls to the static methods.
Week 12: July 30 - August 5
- Refactor Bool.java and NoneType.java and provide all utility methods statically
- Replace uses of Bool and NoneType methods with calls to the static methods.

Note: The official coding period is May 14 - August 14, which is a 13 week period. I've accounted for 12 weeks of work here.

Risk Analysis

There are many gaps in my technical knowledge:

Reading bytecode
Debugging bytecode (there's a tool in ASM, but how useful is it?)
JVM specifics
VOC transpilation specifics
VOC's testing system
... and others I'm not aware of right now

I will try to mitigate this by keeping my mentor updated of my progress, updating this timeline as needed, and asking for help early.

freakboy3742 commented 6 years ago

@patiences This is a really solid first draft - great work!

A couple of suggestions:

It's great that you've included benchmarking in week 7; however, benchmarking is most useful when it's an ongoing measurement process. I'd suggest getting the benchmark situation sorted out first, so you can run the benchmark each week (or after each change) as a measure of ongoing progress.
Interning strings is a good idea, but I'm not sure if the approach you've proposed will work (at least, not as I've understood what you've described). The constant table in a class file needs to be Java primitives and references; Python strings are instances of a Python string class. So, it won't be possible to intern the Python string instances in the constant table; you'll need a different approach.

Of course, another approach would be to modify VOC to use Java strings in place of a separate string class - which require much less messing around with type conversion, but would mess up the internals a lot. However... I suspect it would also be a huge performance bump. Given that you've found conflicting opinions about whether StackMapFrames are worth the effort, our time might be better served looking to see if modifying VOC internals to use Java primitives rather than classes for Python primitives might be worthwhile. Even a speculative experiment that doesn't end up with production code might be a better use of GSoC time than implementing a bytecode structure that doesn't actually benefit performance.

patiences commented 6 years ago

The constant table in a class file needs to be Java primitives and references; Python strings are instances of a Python string class. So, it won't be possible to intern the Python string instances in the constant table; you'll need a different approach.

Ah, right!

another approach would be to modify VOC to use Java strings in place of a separate string class

Do you mean that python methods that operate on strings (Dict#getitem, for example?) should take java.lang.String strings instead?

Given that you've found conflicting opinions about whether StackMapFrames are worth the effort

Yeah, unfortunately I've seen multiple times that the performance gain was more of a theoretical advantage and has not been realized. :-( But wouldn't implementing StackMapFrames be a good idea since they are no longer optional in Java 8 -- is that a good enough reason?

freakboy3742 commented 6 years ago

Regarding strings: I was thinking that the potential optimisation is to completely remove any concept of an org.python.types.String, and replace them with java.lang.String (and similar with Integer, Boolean, and so on). Essentially, we'd be making a special case of Python primitive types, making them fall back to Java primitives.

The catch is that we'd need to provide all the utility methods that Python provides on it's primitive types (e.g., str.format) as a static method somewhere, and if you ever try to invoke a method on a primitive, it would be redirected to the static method.

It wouldn't be a trivial change, though - at the moment, we can assume every object is an org.python.Object; if we start using Java primitives, that distinction would be lost.

As for whether SMF's are worth adding... that's a good question. It's possibly a good move to still add them; at the moment, we can only generate Java 6 bytecode, and while Java VMs are completely backwards compatible, and J6 bytecode will continue to run, it means that if Java ever adds anything interesting (and they already have in J7 and J8), we're might start to hit problems if our Java source code (for the Python standard library) starts to use those features.

patiences commented 6 years ago

Does this mean that users that write Java extensions for their Python code also have to use Java 6 (I don't know how often this happens)? That seems a bit outdated, I personally can't imagine writing java code without J8 lambdas. If this is the case StackMapFrames would probably be a very good idea.

On the other hand, I think the special-casing Python primitive types would be more interesting work for me. :-)

freakboy3742 commented 6 years ago

Combining Java6 and Java8 classes in a single executable shouldn't be a problem... but those are always famous last words :-) And, we don't know what is going to be added in Java9 that might cause compatibility problems with Java6. Pinning ourselves to a known-deprecated format isn't a good idea in general.

Honestly, either StackMapFrames or trying to make better use of primitives would be a viable project; StackMapFrames is better defined, but more detailed; primitive use is a lot more exploratory.

patiences commented 6 years ago

@freakboy3742 Thanks for all the insights! :-)

I've updated my proposal to address your comment about getting the benchmarking down in the first week. I've also replaced the work on implementing StackMapFrames with the primitives types optimizations we've chatted about in the comments here, as it is more interesting work for me and will likely have a huge performance impact. I've also added an extra week to the variable lookup work period -- I've been thinking about how to update the jump addresses and I'm still not sure so I've factored in some more time for that.

Let me know what you think!

freakboy3742 commented 6 years ago

This is looking pretty good - certainly good enough to submit as a final proposal.

freakboy3742 commented 6 years ago

This project was selected for the 2018 GSoC.

patiences commented 6 years ago

At the end of week 1, here's what's been happening...

Started working on small integer preallocation. After meeting with Russ on Wednesday afternoon and getting answers to my questions, had enough information to come up with a WIP commit (https://github.com/patiences/voc/commit/22f506e1be073bdb00948a734364c66c71f34f69) but unit tests currently with a cryptic Java bytecode-related error.
Started working on writing an initial performance test suite, on which I can expand throughout GSoC. Looked at PyBench, PyStone and the performance and perf (https://perf.readthedocs.io/en/latest/) modules, which I'd like to try to use. Would love some feedback generally on whether or not it's a good idea to use the perf library, or to build an internal benchmarking/timing mechanism. WIP commit here: https://github.com/patiences/voc/commit/e385557954dc1381ae83b514022cc911e31b28c1. Weirdly enough running this test also gives me a stack map related error, which I need to investigate.
Having trouble running ASM, which I need to resolve ASAP to move forward. :) Edit: Thanks Russ!

patiences commented 6 years ago

Week 2 finished! I've been working on:

Small integer and boolean pre-allocation. Code here and here. I am currently investigating the effect on in-place operations, once that is fixed, should be ready for PR.
Performance benchmarking test suite, code here. Should be ready for PR soon as well.

No blockers, and plenty of work to do coming into week 3! :-)

patiences commented 6 years ago

End of week 3 update:

Made progress on small integer and boolean preallocation by further restricting Integer and Boolean construction in the codebase on the Java side rather than just the transpilation side. This should greatly improve performance of Integers and Booleans that are created as a result of executing the code (range, for example), and speaking of which:
I've been working a lot on how to demonstrate the performance gain of small integer and boolean preallocation. I realised that I wasn't properly measuring the execution time which was causing a lot of instability in the tests but I now have a system. :-) Will have results to share early next week.

Hoping to wrap up these 3 PRs this coming week, and starting work optimizing global loads!

patiences commented 6 years ago

End of week 4 update:

Boolean optimization is merged and small integer optimization and the benchmarking test are close. Hooray! These optimizations actually have created considerable performance improvement as can be seen by the performance tests results which is super exciting.
At the end of the week I started working on optimizing global loads (https://github.com/pybee/voc/pull/833), and will be continuing to work on that this week.

Thanks @freakboy3742 for spending extra time with me this week, even finding time to pair program! :-)

patiences commented 6 years ago

End of week 5 update (sorry late!):

Small integer optimization and the benchmarking test are merged! Yay! The integer optimization definitely gave us a performance boost (though not as drastic as the boolean optimization), and the test suite provides a way to detect changes that affect performance.
This week has been all about working on optimizing global loads. I opened a number of pull requests experimenting with different types of caching, and after meeting with Russ, decided to focus on trying to cache individual modules. Unfortunately I didn't manage to get pystone working (trying to run it through voc results in a org.objectweb.asm.tree.analysis.AnalyzerException: Error at instruction 3882: Incompatible stack heights), I may look at that in a future week. This week my goal is to get one optimization working.

patiences commented 6 years ago

End of week 6!

I've spent the past week on https://github.com/pybee/voc/pull/839, an optimization for global loads that caches Module objects since those get looked up a lot. There are some pretty neat performance improvements (in the PR) already. The last piece is getting it working for generator functions, and I'm hoping to close this next week.
I've also started working on https://github.com/pybee/voc/pull/845, which is an optimization for function calls that take no arguments. The code is all there, but I'm having trouble writing a performance test for this one, so I'll be working on this early next week as well.

Next week I'll look at getting pystone working as well, as it would be super beneficial, and perhaps doing some profiling. On another note, I've written a blog post about my first month working on VOC, check it out if interested :-)

patiences commented 6 years ago

End of week 7 update:

This week I spent a lot of time on https://github.com/pybee/voc/pull/839, trying to optimize GeneratorFunctions and I think I've found out why it's not working/why it's not a good idea. I'm going to spend just a little more time this week wrapping this up, and then it'll be ready for review.
I've been working on writing a performance test for https://github.com/pybee/voc/pull/845 to show the performance improvement, but I haven't been able to write such a test. I think this is because while there are savings any time we can create fewer objects, the extensive null checks I added to support passing null as an argument on the java side perhaps cancels out any improvement. So it might be the case that there is no actual performance improvement. I'll check with Russ next week to see what we want to do with this PR.
This week I was also able to get pystone running which revealed a bug (fix here). These PRs should be merged soon, and I'll be taking a closer look at the pystone tests to see if we can identify some more candidates for performance optimization there.

patiences commented 6 years ago

End of week 8 update (late because I was away Jul 5-13):

Hooked up pystone PR! The difference on my machine between the beginning of May (before I started working on my GSoC project) and on master now is about 2.8x faster (from ~500 pystones/sec to ~1400 pystones/sec) which is crazy!
Merged https://github.com/pybee/voc/pull/852 -- a bug fix (mentioned above).
Merged https://github.com/pybee/voc/pull/869 -- an optimization that comes from not creating __code__ attributes on functions. An issue to track __code__ inspection in the future is here.
Almost there with https://github.com/pybee/voc/pull/864 -- this one has been dragging because I've had so much trouble showing performance improvement by benchmarking. It turns out that this is because setting up for a function invocation (the site of this optimization) is nothing compared to how much time the function invocation actually takes. I've verified that the generated bytecode is smaller, so there is a little more cleaning up to do and then it will be ready for review.

I've written another blog post about the module optimization from this past month (https://github.com/pybee/voc/pull/839), if anyone wants to know more about that. :-)

patiences commented 6 years ago

End of week 9:

Merged https://github.com/pybee/voc/pull/875. This is an optimization that targets comparison operations (like >, ==, etc) due to org.python.types.Methods (unneccessarily) being created for each operation. I'm happy to say that it's resulted in an huge performance improvement on pystone!
There's a bit more work to do on https://github.com/pybee/voc/pull/877, a few small bugs that came out of working on #875 above. I'll wrap this up early next week.

Next week, I'll continue working through some potential optimizations based on results I've gotten from JProfile, so there's plenty more work to do!

patiences commented 6 years ago

End of week 10 update:

Merged https://github.com/pybee/voc/pull/881. This is an optimization targeting nested loops, by reusing the same instance of StopIterations. This does however introduce some behavioral differences between VOC and CPython, documented here: https://voc.readthedocs.io/en/latest/reference/differences.html
This past week I've been profiling small pieces of code that have exposed a few small bugs, including https://github.com/pybee/voc/issues/882, https://github.com/pybee/voc/issues/885, https://github.com/pybee/voc/issues/886, https://github.com/pybee/voc/issues/888, https://github.com/pybee/voc/issues/890.
Open PRs include 1) https://github.com/pybee/voc/pull/877, which is some bug fixes and refactoring around __contains__/__not_contains__. 2) An optimization targeting __dict__ creation which needs a few kinks worked out, and 3) the beginning of some work on reducing the number of org/python/types/Str objects created (https://github.com/pybee/voc/pull/892).

patiences commented 6 years ago

End of week 11 update:

Merged https://github.com/pybee/voc/pull/893. This was supposed to be just a cleanup of org/python/types/Str, but it turned out to be a decent optimization especially for the print function. (See the PR for benchmarking results)
Merged https://github.com/pybee/voc/pull/899. This is an optimization for org/python/types/Dict operations, which are ubiquitous in Python and also in the Pystone benchmark (we saw an amazing performance improvement on Pystone with this one, details in the PR).
Decided not to continue pursuing https://github.com/pybee/voc/pull/880, because it wasn't showing performance improvement and I was a bit mistaken about the way that Java initializes HashMaps.
Logged a few more bugs (https://github.com/pybee/voc/pull/889, https://github.com/pybee/voc/pull/904) and a quick refactoring (https://github.com/pybee/voc/pull/894).

This coming Tuesday is the official cutoff for GSoC, but I will continue working for about a week after. I'm currently working on a blog post for the website, and wrapping up a few more PRs this coming week.

patiences commented 6 years ago

End of week 12 update:

Merged https://github.com/pybee/voc/pull/910. This is the same optimization as https://github.com/pybee/voc/pull/899, for Dict.__setitem__().
Still working on https://github.com/pybee/voc/pull/902, which still needs some cleaning up. The optimization is about creating standard org/python/Object functions at demand, not at module initialization time, which speeds up our initialization time.
Started working on https://github.com/pybee/voc/pull/909 , which is about method caching. Method calls are currently very expensive in VOC because each time we call a method, we create it anew. This optimization yields a significant performance boost, but we are still figuring out how to handle __dict__ inspection if method objects are cached there.
Started working on https://github.com/pybee/voc/pull/905 which is about adding a python base object test suite. I'm having a lot of trouble with getting the tests passing, it looks like there are a lot of bugs around this (some of which, if not too hairy, I'm also addressing in the PR). Still more work to be done for this.
https://github.com/pybee/voc/pull/911 is ready for review -- it's a clean up of (what seems to be) legacy code that creates unnecessary objects.
Started working on https://github.com/pybee/voc/pull/913, which is part of a bug fix for https://github.com/pybee/voc/issues/907 and https://github.com/pybee/voc/issues/912.

I'm also currently working on hooking up a few more benchmarks from http://speed.pypy.org/comparison/ (which is where I'm finding all these new bugs!). I will try to wrap these up early this week if I can as I am planning on working as part of GSoC until Wednesday (but will be around to finish up anything that needs a little more time of course).

beeware / voc