Tcl runtime performance improvements

sebres commented 7 years ago

I can try to back-port several functionalities of my tclSE to current trunk/tcl8.6 after I'll get tcl-clock speed-up production (or acceptance) ready. So I think that https://github.com/flightaware/Tcl-bounties#tcl-runtime-performance-improvements would be reasonably feasible at least to 2x. But 10x (at least partially) is also to imagine.

For example I've there another (faster) lexer, very fast byte code, tclOO engine, etc.

Pair questions in-between:

what is "benchmark program TBD"? :)
where I can get it or look what it does? :)
speed-up with factor 10x is very resp. extremely case depended (so for example if you very often use clock scan somewhere, you've already now a speed-up up to 100% ;-)
Or if you use extreme slow regex somewhere, improving of the code-base all around won't really affect the whole performance. So which functionality are more by importance?
many cases could be not necessarily noticeably faster in the single-threaded environment, but extreme faster in multi-threaded environment. What about it?
what's about (binary) compatibility?

sebres commented 7 years ago

BTW. Apropos "TBD" and "We need help with the benchmark"...

The command timerate back-ported from my tclSE (see https://github.com/sebres/tcl/pull/3) in couple with diff may be used to create performance measurement scripts.

How it can be made you can for example see in https://github.com/sebres/tcl/pull/2 (see script test-performance.tcl).

Additionally I can imagine that many performance test cases may be extracted (also half-automatically) from tcl-own test-cases (folder tests).

sebres commented 7 years ago

I found my old ticket, that can explain how easy the large performance improvement can be done... Current ensembles handling have very large overhead (8.6 to trunk) in comparison to pure command execution. For example in current trunk the execution of ensemble takes 70% more time as the command self (i5-6500, 3.2GHz, 6MB cache, 34.1 GB/s memory bandwidth):

% set s [string repeat A 50000]; puts ok
ok
%
% # calibrate to minimize influence of measurement overhead (and byte code execution overhead):
% timerate -calibrate {}
0.04537374883248807 µs/#-overhead, 0.045506 µs/#, 42895062 #, 21974929 #/sec
%
% # test compiled execution:
% timerate {string first A $s 2000}
0.321560 µs/#, 2725287 #, 3109837 #/sec
% timerate {::tcl::string::first A $s 2000}
0.191541 µs/#, 4220926 #, 5220810 #/sec
%
% # calibrate again (slower, non-compiled direct execution via TclEvalObjEx):
% timerate -direct -calibrate {}
0.07950384713110215 µs/#-overhead, 0.080109 µs/#, 24366669 #, 12482924 #/sec
%
% timerate -direct {string first A $s 2000}
0.313794 µs/#, 2542604 #, 3186803 #/sec
% timerate -direct {::tcl::string::first A $s 2000}
0.190985 µs/#, 3697004 #, 5236000 #/sec

sebres commented 7 years ago

Above mentioned results were from i5-6500.

For i5-2400 it looks still worse (i5-2400, 3.1GHz, 6MB cache, 21 GB/s memory bandwidth):

% set s [string repeat A 50000]; puts ok
ok
%
% # calibrate to minimize influence of measurement overhead (and byte code execution overhead):
% timerate -calibrate {}
0.059596535033614534 µs/#-overhead 0.059622 µs/# 32739719 # 16772397 #/sec
%
% # test compiled execution:
% timerate {string first A $s 2000}
0.421260 µs/# 2079582 # 2373830 #/sec 876.045 nett-ms
% timerate {::tcl::string::first A $s 2000}
0.226579 µs/# 3494358 # 4413466 #/sec 791.749 nett-ms
%
% # calibrate again (slower, non-compiled direct execution via TclEvalObjEx):
% timerate -direct -calibrate {}
0.1062353003542841 µs/#-overhead 0.106250 µs/# 18371716 # 9411739 #/sec
%
% timerate -direct {string first A $s 2000}
0.405472 µs/# 1954244 # 2466262 #/sec 792.391 nett-ms
% timerate -direct {::tcl::string::first A $s 2000}
0.213387 µs/# 3128694 # 4686318 #/sec 667.623 nett-ms

Because both systems differentiate almost only by the memory bandwidth, the ensemble overhead seems to be memory-related also (very strange on the constellation).

sebres commented 7 years ago

Apropos "issue" from above (ensemble overhead) in tcl8.5 and my SE-fork - below you'll find results, compared with current trunk on i5-6500, 3.2GHz.

tcl8.5:

% info patchlevel
- 8.7a0
+ 8.5.19
% timerate -calibrate {}
- 0.04537374883248807 µs/#-overhead, 0.045506 µs/#, 42895062 #, 21974929 #/sec
+ 0.030185025513108005 µs/#-overhead 0.030270 µs/# 64487338 # 33036546 #/sec
%
% timerate {string first A $s 2000}
- 0.321560 µs/#, 2725287 #, 3109837 #/sec
+ 0.256630 µs/# 3192446 # 3896667 #/sec 819.276 nett-ms
% timerate {::tcl::string::first A $s 2000}
- 0.191541 µs/#, 4220926 #, 5220810 #/sec
+ 0.133809 µs/# 5251583 # 7473339 #/sec 702.709 nett-ms

tclSE:

% info patchlevel
- 8.7a0
+ 9.0-SE.18
% timerate -calibrate {}
- 0.04537374883248807 µs/#-overhead, 0.045506 µs/#, 42895062 #, 21974929 #/sec
+ 0.030022483025454204 µs/#-overhead 0.030022 µs/# 65017940 # 33308370 #/sec
%
% timerate {string first A $s 2000}
- 0.321560 µs/#, 2725287 #, 3109837 #/sec
+ 0.122230 µs/# 6568019 # 8181266 #/sec 802.812 nett-ms
% timerate {::tcl::string::first A $s 2000}
- 0.191541 µs/#, 4220926 #, 5220810 #/sec
+ 0.117670 µs/# 6770841 # 8498362 #/sec 796.723 nett-ms

sebres commented 7 years ago

I've found a difference between my tclSE and tcl.core as regards the overhead of ensemble.

Because clock is an ensemble-command also, thereby it is additionally faster now (up to 2x).

Below you'll find new measurement result for clock-command: clock-perf-test-new-without-ensemble-overhead.txt just compare with previous clock-test-perf-new.txt

Of course it concerns to all ensemble command in tcl. Thus 2x runtime performance improvement (at least partially) is reached

I'll wait for any feedback... because so far no reaction at all here (neither my questions are answered, nor any statements given about performance test-scripts, etc.) If still interested, I'll rebase it as new performance-branch in fossil.

sebres commented 7 years ago

Closed as not interested

resuna commented 7 years ago

Sorry for not getting back with you on this, I've just started taking over being the front guy on the bounties from Karl.

Let me start with these questions:

what is "benchmark program TBD"? :)
where I can get it or look what it does? :)

Benchmark program to be determined. So it doesn't exist at this point.

speed-up with factor 10x is very resp. extremely case depended (so for example if you very often use clock scan somewhere, you've already now a speed-up up to 100% ;-)
Or if you use extreme slow regex somewhere, improving of the code-base all around won't really affect the whole performance.

We're kind of expecting a 10x speedup to require JIT native code generation. Both speedups would be expected to be more apparent in heavily algorithmic and numeric code, rather than code that spends most of its time called out to libraries and extensions.

sebres commented 7 years ago

Hi Peter, sorry for the stillness... (too many work-related things to be done).

Well so what, I'm back now and will try to rebase this all as "sebres-performance-branch" to tcl-core fossil-repository.

BTW. Have you tried in the meantime to take a look at my clock-solution?

sebres commented 7 years ago

Gah, forgotten!

Recently I've got a closer look at a ticket Exploding runtime with growing list of after idle events and found very large difference to my fork-branch.

And not only mentioned issue, but fundamental problem about an timer-idle events. So it has currently a serious impact by event-driven tcl-programs. I develop always event-driven, so all my fundamental classes like database, sockets, etc. are asynchronous resp. whose handling is event-driven. That's why I've totally rewritten in my fork the tclTimer module resp. "after" command, etc.

If you develop also event-driven, then this may be very helpful and can provide very large performance increase (up-to several hundred thousand times, depending on the size of the event-queue). Especially multi-threaded, because original handling very cpu-hungry and cache-expensive.

I've created several performance test-cases, whose result you can find below:

Diff: event-perf-test.diff.txt Org-tcl8.7: event-perf-test-org.txt Mod-tcl8.7: event-perf-test-new.txt

If interested also, I can rebase it into my new performance-branch in fossil together with ensembles-improvement.

The overhead costs in original tcl-core are not linearly (sometimes grows exponential), please find enclosed a small excerpt of total summary by 10000 and 60000 events in the queue):

10000 events in queue:

-Total 8 cases in 0.00 sec. (3.84 nett-sec.):
+Total 8 cases in 0.00 sec. (0.02 nett-sec.):
-3840255.000000 µs/# 8 # 2.083 #/sec 3840.255 nett-ms
+23607.000000 µs/# 8 # 338.883 #/sec 23.607 nett-ms
 Average:
-480031.875000 µs/# 1 # 2 #/sec 480.032 nett-ms
+2950.875000 µs/# 1 # 339 #/sec 2.951 nett-ms
 Max:
-1489702 µs/# 1 # 0.671 #/sec 1489.702 nett-ms
+4178.00 µs/# 1 # 239.35 #/sec 4.178 nett-ms

60000 events in queue:

-Total 8 cases in 0.00 sec. (352.27 nett-sec.):
+Total 8 cases in 0.00 sec. (0.16 nett-sec.):
-352267646.000000 µs/# 8 # 0.023 #/sec 352267.646 nett-ms
+157668.000000 µs/# 8 # 50.740 #/sec 157.668 nett-ms
 Average:
-44033455.750000 µs/# 1 # 0 #/sec 44033.456 nett-ms
+19708.500000 µs/# 1 # 51 #/sec 19.709 nett-ms
 Max:
-169016982 µs/# 1 # 0.006 #/sec 169016.982 nett-ms
+27429.0 µs/# 1 # 36.458 #/sec 27.429 nett-ms

resuna commented 7 years ago

This looks promising. We also do a lot of event-driven code, and I've had to play games moving events from one queue to another to avoid starvation but haven't dug deeper because the behavior I saw was easily explained simply by the lower priority of timer events.

A 2x speedup for ensemble commands by itself isn't likely to get a 2x speedup overall, since only a fairly small number of commands are ensembles. [string subcommand] and [array subcommand] seem the most likely candidates to have an impact, and they're only a few percent (a quick check of a sample of my code found them in about 5% of non-comment non-empty lines).

Combining all these together might get there.

sebres commented 7 years ago

By the way: I've found another screw, that could be turned to improve whole Tcl performance (byte-code execution) up to 2x and higher (although I had measured 6x - 10x also on some complex cases, but I'm not sure all things are portable to the core without extremely large effort).

Just as reference values:

% info patchlevel
-8.6.7
+8.6-SE.10b3
% timerate -calibrate {}
-0.06132125572047973 µs/#-overhead 0.061321 µs/# 31832339 # 16307559 #/sec
+0.03671978085047272 µs/#-overhead 0.036918 µs/# 52874530 # 27087375 #/sec
% timerate {set a 1}
-0.067728 µs/# 7749007 # 14765019 #/sec 524.822 nett-ms
+0.010424 µs/# 13938133 # 95928567 #/sec 145.297 nett-ms
% timerate {set a}
-0.033416 µs/# 10555488 # 29925545 #/sec 352.725 nett-ms
+0.005117 µs/# 15051566 # 195424123 #/sec 77.020 nett-ms

Please note that already the execution of compiled empty code is about two times faster (see result of the calibration overhead ~ 0.0367µs vs. 0.061µs).

Additionally to my latest above-mentioned event-performance branch (and fix-ensemble-overhead), as combination of all these, it should exceed 2x speed increase on almost all tasks and possesses very considerable potential to reach 10x bounds. (Although 10x on several event-tasks was already reached, see summaries under spoiler in https://github.com/sebres/tcl/pull/4#issuecomment-312252977).

The problems I've currently additionally to the general complexity of the back-porting of this "screw" to standard tcl-core:

very conservative bearing of TCT regarding acceptance of something more complicated as "hello world" as well as too slow progress as regards something else as bug-fixing;
already too many of my branches are open (further development resp. merge of all these in the long perspective will guarantee a sort of conflicts, but still worse future incompatibilities);
I'm permanently under time pressure (e. g. professional).

Thus I'm really not sure I want do this step and open up still one construction site, without prospects of sustained success. @resuna rather we should possibly close this and do not dazzle other resp. let other people try their luck with the bounties.

sebres commented 7 years ago

Thus closing, sorry for noise.

sebres commented 6 years ago

Because this is about the performance... just to reference BTW - https://github.com/sebres/tcl/pull/5

I tried it with one "old" (but big) application of me, which uses regexp in the wild (directly and indirectly), sometimes very intensively (and with binding on sqlite as functions): after switch in the init.tcl to pcre using one-line mod interp regexp {} pcre (and fixing exactly 2 incompatible regexp's), this results in 20x resp. sometimes 30x performance increase of whole app. Also the internal test-cases ran ca. 30x faster.

Rewriting some cycles searching via regexp for multiple matches from tcl-NFA to DFA (with exact one match-search for multiple alternatives with cycle over exactly one result of this match, instead of multiple incremental search in the cycle) increases performance of this functionality up to 1000x!

Regards, Sergey.

flightaware / Tcl-bounties

Tcl runtime performance improvements #21