Open jjvanhorssen opened 3 years ago
Thanks for the benchmark. I don't see anything obvious here that ParparVM is known to be slow with. I would need to profile this in Xcode to see if there are any heavy hitters. I don't have an ETA as my plate is pretty full at the moment.
FYI, I also tried with native implementations of Long.bitCount() and Long.numberOfTrailingZeros() for JavaSE, iOS/Objective C (e.g. __builtin_popcount) and Android/Java. My benchmark gets 15% faster on the MacBook/Simulator, 7% slower on the Samsung tablet and 30% slower on the iPhone X. So it seems there is no support (yet) on device hardware for popcount instructions etc. and there is native call overhead. Too bad but it was worth a shot.
To add some historic information: at the time of the introduction of my app in 2013 (32-bit), execution on iOS was clearly faster than on Android (iPhone 5 & iPad 4 vs Google Nexus 7 tablet).
A couple of points:
native
keyword).Many things have changed since 2013 on iOS, Android, and Codename One. I can't draw any conclusions by your observation that your iPad4 was faster tan Nexus 7. That is to be expected. Nexus 7 was quite slow.
If you want to look for performance regressions in Codename One, try building your app now and test it on iPhone 5, iPad 4, and Nexus 7, and see if you get the same results as in 2013.
Okay thank you for the information, I will look into it some more.
I ran some additional benchmarks and also I profiled my app on device with Xcode. It is clear now where the performance bottleneck is. The ParparVM code generation is very general and powerful but not suitable for (say) a high-performance chess engine, as there is a lot of overhead such as code tracing. My engine consists of >6000 lines of code of which large parts are time-critical, with hundreds of (small) methods, many meant to be inlined by an optimizing compiler.
Now I am considering to translate the complete engine for iOS using J2ObjC and call it somehow from a native method. So I want to combine my CN1 Java GUI with the Objective-C engine. Is this possible with Codename One and do you think it is a good path to follow?
Interesting approach. It might work. Personally I think it would be easier to just identify the bottleneck methods and reimplement those in C. Introducing J2ObjC introduces more complexity, and it isn't clear that it will provide better performance.
ParparVM already does optimization for many common cases, both with native C implementations of important methods, and with compiler optimizations at the AST level. It is likely that your code is just using some particular methods that are expensive/not optimized on ParparVM.
Another option, if the bottleneck is the "small methods" that need to be inlined, is to preprocess your library using Proguard, as it has configuration settings to perform aggressive inlining like this.
Using the Xcode profiler I identified 15 small methods that were most time-consuming and re-implemented those in C. That is, I replaced the generated code with the C code in the .m file, so there is no calling mechanism overhead. Example:
Java code:
public static long Long_lowestOneBit(long i) {
return i & -i;
}
Generated code:
JAVA_LONG com_xx_yy_shared_Util_Long_lowestOneBit___long_R_long(CODENAME_ONE_THREAD_STATE, JAVA_LONG __cn1Arg1) {
volatile JAVA_LONG llocals_0_ = 0; /* i */
__STATIC_INITIALIZER_com_xx_yy_shared_Util(threadStateData);
DEFINE_METHOD_STACK(4, 2, 0, 10560, 10590);
llocals_0_ = __cn1Arg1;
__CN1_DEBUG_INFO(261);
BC_LLOAD(0);
BC_LLOAD(0);
SP[-1].data.l *= -1; /* LNEG */
SP--; SP[-1].data.l = SP[-1].data.l & (*SP).data.l; /* LAND */
releaseForReturn(threadStateData, cn1LocalsBeginInThread);
return POP_LONG();
}
Native C code:
JAVA_LONG com_xx_yy_shared_Util_Long_lowestOneBit___long_R_long(CODENAME_ONE_THREAD_STATE, JAVA_LONG i) {
return i & -i;
}
or even
JAVA_LONG com_xx_yy_shared_Util_Long_lowestOneBit___long_R_long(JAVA_LONG i) {
return i & -i;
}
and remove 'CODENAME_ONE_THREAD_STATE,' in the definition and 'threadStateData,' in the calls.
Doing this for 15 methods of 1-10 lines of code, and by disabling __CN1_DEBUG_INFO (undef/define) in a few classes, made my app run more than twice as fast. That is a good start but more should be possible.
Profiling the new version, the most time consuming method is now initMethodStack (as a result of every method having DEFINE_METHOD_STACK(...)) which consumes 24.3% (and releaseForReturn consumes 2.5%). I put in a counter and found that in my new benchmark (in the complete app), which runs in about 80 seconds on an iPhone X, initMethodStack is called about 1.5 billion times! That is a lot of overhead (together with releaseForReturn and __CN1_DEBUG_INFO).
The next most time-consuming method is my main search function with (only) 12.1% and next are my move generator functions, many methods with only a few percent. Rewriting all that in C is a lot of work, so it would be helpful if there were a mechanism to avoid the (ParparVM) overhead. For instance, to be able to write in Java:
@IOSNative
public static long Long_lowestOneBit(long i) {
return i & -i;
}
(For Android, @IOSNative is ignored.)
If this is not possible or too drastic, it would also be helpful if it were possible to override the iOS code generated for a function by providing the C version yourself (see above) in e.g. /native/ios/com_xx_yy_shared_Util_native.m. If a method com_xx_yy_shared_Util_Long_lowestOneBit___long_R_long is present there it will be used, otherwise the normal code is generated.
The native methods can use the CN1 macros, e.g. to access data (arrays). Here is another example:
JAVA_INT com_xx_yy_shared_Util_long2index___long_R_int(JAVA_LONG b) {
JAVA_LONG b1 = (b ^ (b - 1));
JAVA_INT folded = (((JAVA_INT)b1) ^ ((JAVA_INT)BC_LUSHR_EXPR(b1, 32)));
return CN1_ARRAY_ELEMENT_INT(get_static_com_xx_yy_shared_Util_foldedTable(), BC_IUSHR_EXPR((folded * 2015959759), 26));
}
Using code substitution like this is runtime efficient and I don't see how I can match that using the currently available mechanisms. It is also tedious to do it manually, for each app update.
Of course I am open to feedback and suggestions. If there were a way to drastically improve performance for code like this (basically C programming using CPU & RAM only, that needs to run as fast as possible) without having to rewrite hundreds or thousands of lines of code that would be awesome.
Since most of the overhead is from very short method calls (initializing and releasing the method stack) there may be significant gains by preprocessing the class files with proguard. It provides many optimizations, including inlining.
The easiest way to test this out would be to package your library as a cn1lib. Then perform surgery on it as follows:
Using a cn1lib and Proguard to preprocess the class files gave an out-of-the-box improvement of about 8% compared to the baseline version (no optimizations). By changing the code a bit (making many short critical functions private instead of public to the library) and by allowing more optimization passes (4, I don't know if this does much) this improved to a 20% time reduction (on average). I don't see how this can be improved significantly.
The 20% time reduction is modest compared to the time reduction achieved with substituting generated code by native code for 15 functions (see above): about 57% on average. And more is possible by providing more native code. As I said, this is a lot of work to do manually and it creates a maintenance issue.
Is Codename One prepared to consider my suggestions (see above) to make for more efficient compilation of low-level code or to facilitate native code substitution?
There are "simple" things we can do like removing __CN1_DEBUG_INFO
. This might be doable with an annotation so you don't have to do it globally and you just won't have class information for that specific file.
I'm not sure if removing CODENAME_ONE_THREAD_STATE
is practical in terms of the code and how much of an impact that will carry though. There's a lot of code in the optimizer section that's smart enough to detect some trivial methods like getters/setters and effectively generate a getter/setter. It might be adaptable to some of the methods you have.
Whether we'll actually do this. I don't know if we have the resources.
I'm not sure if the J2ObjC route will give you a much better result since it will generate Objective-C code with slower method calls and ARC overhead. I'm not sure if that code will even work with ours since we don't support ARC.
It would be nice if it were possible to remove all or most of the VM overhead for low-level functions, although I understand the difficulties (you probably need at least stack and SP). Ideally, I would like to see removed: CODENAME_ONE_THREAD_STATE, use of the volatile keyword, double initialization of llocalsi, the __STATIC_INITIALIZER, DEFINE_METHOD_STACK, __CN1_DEBUG_INFO and releaseForReturn. My estimate is that this overhead slows down my engine by a factor 3 to 4, which is significant in loss of playing strength on iOS devices. So please consider this issue as a request for change.
The optimizer code can remove most and literally translate a getter or setter to their equivalents. Right now we can't change the method signature since it's invoked from everywhere so this overhead is fixed. Since a simple method won't generate garbage and won't throw an exception it doesn't need all of those structures and is an ideal candidate for optimizations of that type.
Notice that this optimizer code is written 100% in Java and while it's a bit convoluted, it's still not too hard if you understand the basics of bytecode.
Thanks. I had a look at the translator code but delving into this is a rather big detour for me. Good to see that the issue is a 'good first'!
The CPU performance of Codename One apps on iOS devices is suboptimal. See also https://stackoverflow.com/questions/66537486/codename-one-ios-64-bit-performance.
Attached is a performance test project, consisting of the standard performance test (called Perft) as used by chess engines, or in this case a 10x10 draughts engine. The project is complete, it just needs signing for iOS builds. Just press the Go! button and wait for an email to pop up with the log file contents, consisting of the performance test results. Multiple runs are possible.
See Perft.zip
Below are my test results.
MacBook CN1 Simulator: 3.61 sec Samsung Tab A 10.1 Android debug build: 26.49 sec Samsung Tab A 10.1 Android release build: 15.64 sec iPhone X iOS debug build: 56.72 sec
The performance on the Samsung is satisfactory, the performance on the iPhone X is below par. The question is whether the performance can be improved for iOS devices.