serious performance breakdown of ksh2020 in comparison to ksh93u+

att / ast

AST - AT&T Software Technology

Eclipse Public License 1.0

558 stars 152 forks source link

serious performance breakdown of ksh2020 in comparison to ksh93u+ #1449

Open jghub opened 4 years ago

jghub commented 4 years ago

Description of problem:

consider a stupid micro benchmark like ` function repeatloop { nameref count=$1 integer i=0 buf=0 for ((i=0; i< $count; i++)); do ((buf++)) : done print $buf }

count=1000000 time repeatloop count `

Ksh version: on my machine (OSX 10.4.6, powerbook)

93u+: 1.01s 2020 release: 2.12s head (42a580c5): 2.98s

How reproducible: always

Steps to reproduce: run above script

Actual results:

100-200% slow down (i.e. by a factor 2-3!) of ksh2020 in comparison to 93u+. although ksh2020 thus remains faster than bash (11s) the "order of magnitude faster than bash" is no more . if this is not fixed one might consider to copy this line from the bash manpage: "It's too big and too slow."

I find a performance breakdown by such a massive factor disconcerting. it is a big step back from ksh93u+.

Expected results:

competitive performance

Additional info:

do people see this behaviour on other platforms, too?

KornShell language is not only a command interpreter for interactive use. it also is a serious programming language. performance matters a lot. ksh always had by far the best performance of the major shells. current development in this ksh2020 project seems to severely damage ksh in this respect. I hope this can be fixed. are there performance benchmarks in the test suite at all?

krader1961 commented 4 years ago

consider a stupid micro benchmark

LOL! Yes, a lot of micro benchmarks are stupid in as much as they are irrelevant with respect to any real world program or shell script. Having said that, I am disturbed by the factor of three change in run time that I am also seeing on my primary platform.

KornShell language is not only a command interpreter for interactive use. it also is a serious programming language.

I disagree, notwithstanding being a fan of ksh as the first UNIX shell I considered usable. The korn shell can not be a "serious programming" language due to the POSIX shell standard. Having said that I agree this slow down is not acceptable.

ghost commented 4 years ago

The korn shell can not be a "serious programming" language due to the POSIX shell standard.

Weird. I use it as a serious programming language all the time.

jghub commented 4 years ago

yes, that is a very strange and very wrong, but no longer surprising, assessment by krader1961. it also makes me wonder where he sees any useful application for ksh: he does not like it for interactive use ("you should use fish"). he does not believe one should use it for substantial programming. so what? 10-50 lines shell scripts for stupid tasks? is that all, really? no. it is "just another useful scripting language". Korn himself at least viewed it on par with perl, Tcl, etc.. while I not fully agree, he sure was nearer to the truth than krader1961...

@krader1961: at least I am happy to see that you agree to label the slowdown correctly as unacceptable. it totally is. see above: people do use ksh for programming. and speed matters a lot. always. do you have any idea what broke performance so massively? 10% is one thing. 300% is something else.

jghub commented 4 years ago

to be on the save side I ran the test with Version ABIJM 93v- 2014-12-24, too. result: 93v- about 10% slower than 93u+. that I can understand: 93v- was in beta state and would have seen further optimization in due course, if development had not stopped then... so in any case the slowdown is not the fault of the code base you started with.

krader1961 commented 4 years ago

@jghub Your script does not depend on any change introduced by ksh93v-, and should not be affected by any change introduced by that experimental code. So I don't understand why you think that 10% slow down is understandable. Especially since even a 10% slow down between adjacent minor releases is huge.

jghub commented 4 years ago

300% is huge. 10% is 10%. 93v would have been a major release relative to 93u, not a minor one, in the view of D. Korn as far as I can tell.

understandable/possible explanations:

not identically optimized compilation of 93u+ release and 93v- beta version?
10% difference is near the margin of reproducibility in "non-scientific" tests like on my labtop (which is also doing assorted other stuff in other processes, even if not kept busy by me wittingly). so maybe it is even less or nothing (but probably effect is true. repeated tests showed similar results)

and maybe we should talk about the remaining 10% only after you have found the other 290% that have crept in relative to 93v-?

krader1961 commented 4 years ago

@jghub I suggest you read https://github.com/att/ast/issues/42. You seem to think I'm actively trying to destroy ksh as a viable shell for writing scripts in the 21st century.

I've done some carefully controlled tests using my Ubuntu 16.04 VM. The /bin/ksh on that distro runs the test script in 1.7 user seconds (not real-time since that is an unreliable metric on a VM). The ksh93v- I built took 2.5 user seconds. So already we're looking at a 47% increase before the current team has made any changes. The source at commit 02ca6c6cca (which introduced support for Meson), and where very few changes had been made since ksh93v-, shows the same execution time.

Bisecting the code shows there are two big bumps (i.e., > 1s). The first is commit bb57b0b012a965a07f4099ccdffab1f182968356 that removed the hard-coded -Os compilation flag. So, obviously, we need to figure out how to coerce Meson to favor code size when compiling or reinstate that hard-coded flag.

are there performance benchmarks in the test suite at all?

No. The AST team did not include any performance regression tests in the test suite we inherited, AFAICT. Such tests should be added; however, doing so is non-trivial to avoid so many false negatives that the tests are pointless.

and maybe we should talk about the remaining 10% only after you have found the other 290% that have crept in relative to 93v-?

@jghub I welcome any issue you open, including this one, which documents an apparent bug, performance regression, or any other problem. However, your concerns would be better received if your comments didn't have so much anger. You seem to think I'm trying to destroy the usability of the Korn shell. I started using ksh88 somewhere around 1990. It was far better than sh, csh, and tcsh when its interactive behavior and scripting capabilities were considered as a whole. That is the sole reason I am contributing to this project.

jghub commented 4 years ago

@jghub I suggest you read #42. You seem to think I'm actively trying to destroy ksh as a viable shell for writing scripts in the 21st century.

I have seen that thread, among others. all confirm that you received the same feedback from several people from the very beginning: have more respect for the code base, proceed with care. more like a neuro surgeon than like a warrior taking no prisoners.

you never listened, let alone changed your attitude and approach. your reactions were showing a consistent inability or lack of interest to appreciate the details of what others said and an uncompromising desire to interfere heavily with the code base to make a "tidy up" no matter what the consequences and without understanding the code structurally and in its entirety. add to this an objectionable loudly voiced disregard for the capabilities of the creator of ksh and adolescent showing off in your bashing of his coding style.

so, no I do not believe you are actively (in the sense of intentionally) trying to destroy ksh. I do believe you have actively (in the sense of being the causal factor) and severely damaged ksh already: and the results after two years of this activity are as they are and speak for themselves for everybody to judge by himself: run ksh93u+ or ksh93v- against ksh2020 and compare: features, user-facing/visible buggy behaviour in interactive and scripting use, performance.

ksh2020 loses that comparison on all fronts AFAICS (even taking into account that, indeed, you fixed some bugs in TAB completion and such). this is a shame and a pity. but all of the above reiterates only what you have been told so many times before. so this really makes no sense.

I still hope for a chance to see in this project or elsewhere a "reset" to make a carefully curated bug-fix release of ksh93u+ (or the original ksh93v-): that would be a much better shell than ksh2020 in my view.

I've done some carefully controlled tests using my Ubuntu 16.04 VM. The /bin/ksh on that distro runs the test script in 1.7 user seconds (not real-time since that is an unreliable metric on a VM). The ksh93v- I built took 2.5 user seconds. So already we're looking at a 47% increase before the current team has made any changes.

I can not confirm this on OSX (not the least important unix-like platform out there...). here are the results for a couple more micro benchmarks

` /bin/ksh ksh93v- ksh2020 braces.ksh 0m6.18s 0m6.37s 0m16.24s gsub.ksh 0m0.05s 0m0.03s 0m0.03s numfor.ksh 0m0.67s 0m0.79s 0m2.35s numloop.ksh 0m0.62s 0m0.65s 0m1.28s parens-for.ksh 0m0.25s 0m0.11s 0m0.24s parens.ksh 0m0.25s 0m0.12s 0m0.25s plus-equals.ksh 0m0.12s 0m0.14s 0m0.18s

93u+ and 93v- are practically equally fast. no way is 93v- 47% slower on OSX than 93u+ (I don't see a good reason why it should be on ubuntu, but cannot test that right now). if at all Korn did make 93v- faster not slower than 93u+, mostly.

"numfor.ksh" is the previously reported behaviour incrementing a counter in a numeric for loop: ksh2020 slower by a factor or 3. as already reported.

"numloop.ksh" does the same but with a while loop. observation: 93u and 93v do both loops nearly equally fast. ksh2020 breaks that and makes, paradoxically, the numeric for loop slower by a factor of two. so: ksh2020 slower by a factor of 2 in this case in comparison to 93u and v.

"braces.ksh" reads `

function f { echo "hello from f" } i=0 max=1000000 while (( i < max )); do out=${ f; } (( ++i )) done

and tests efficiency of doing command substitution via ksh93's important ${ command; }' syntax that executescommand' in the current shell rather than a subprocess.

comparison with "numloop.ksh" (which is the same thing w/o the command substitution) was slower by a factor of two. and doing the `${ command; }' substitution slows it down by a further factor 1.5 nearly: ksh2020 here is overall slower by nearly a factor of three again.

the factor of two worse performance relative to 93v- is there in some other benchmarks as well as you can see. it is quite consistent.

so for my platform I challenge your claim that 93v- was already about 50% slower than 93u+ and I maintain that ksh2020 is "guilty" of the complete slow down, not only part of it.

The source at commit 02ca6c6 (which introduced support for Meson), and where very few changes had been made since ksh93v-, shows the same execution time.

then something happened already before that point it seems.

Bisecting the code shows there are two big bumps (i.e., > 1s). The first is commit bb57b0b that removed the hard-coded -Os compilation flag. So, obviously, we need to figure out how to coerce Meson to favor code size when compiling or reinstate that hard-coded flag.

see above: it seems your starting point is already somehow flawed. but if you can locate later points where damage was done and revert the changes or fix it otherwise it would be a step forward. but not the complete solution to the problem.

No. The AST team did not include any performance regression tests in the test suite we inherited, AFAICT. Such tests should be added; however, doing so is non-trivial to avoid so many false negatives that the tests are pointless.

as with all testing, you test only what you test. but that makes it not produce false positives (that's what you mean, actually, not negatives, but anyway) and not pointless. the above micro-benchmarks test behaviour in typical "hot loops" which are the rate limiting step in many a larger application. what I see in these micro-benchmarks I also see in more serious scripts doing some "heavy lifting" when I compare, for instance bash5 from 2019 vs. ksh93u+ from 2012: ksh93 is faster typically by an order of magnitude, sometimes "only" 3x faster, sometimes 20x.

basically your reaction here could be added to all the other instances. you tend to try to bypass the real issue: in your massive overhaul of the code base it should have been mandatory to run whatever performance benchmark you like to verify continuously that you don't slow it down by your changes. (I still hope it is "only" compiler optimisation flags and not your replacing the memory stuff for the sole reason to make it "standard"...). that you did not notice at all what was happening to performance is alarming.

@jghub I welcome any issue you open, including this one, which documents an apparent bug, performance regression, or any other problem. However, your concerns would be better received if your comments didn't have so much anger. You seem to think I'm trying to destroy the usability of the Korn shell. I started using ksh88 somewhere around 1990. It was far better than sh, csh, and tcsh when its interactive behavior and scripting capabilities were considered as a whole. That is the sole reason I am contributing to this project.

"apparent bug". not "obvious"?

"anger": I don't feel any of that I can assure you. concern: yes -- I just would have wished for a responsible more careful/cautious and more considerate curator of ksh93. if this were only your private playground, nobody would care (me included).

but the misleading claim that this project is the official future of "ksh as a viable shell for writing scripts in the 21st century." -- which is sort of "knighted" by this happening in the ATT/AST repo rather than in your fork of that repo -- makes it a serious problem in my view.

these are my concerns as well as those of quite some people before me (most of them have long ago given up posting anything here for obviously arriving at the insight that it is pointless to argue with you about these things). you don't receive any of this serious and well-intentioned feedback (for how to maintain ksh, that is) well since you confuse the project with your person. I have got that.

You seem to think I'm trying to destroy the usability of the Korn shell. I started using ksh88 somewhere around 1990. It was far better than sh, csh, and tcsh when its interactive behavior and scripting capabilities were considered as a whole. That is the sole reason I am contributing to this project.

see somewhere above: I already answered to that. I firmly believe you are misguided in your approach, not in your intentions. you clearly overestimate your competence (at the very least in comparison to the creator of ksh93) and this causes harm. my prediction is that you will not change your attitude and will proceed with the same self-righteous disregard for the code base and voiced concerns you exhibited from the start.

I would be happy if you could proof me wrong here, of course, and all would be good in due time: ksh2020 really restoring the level of stability and performance ksh93u+ is known for. ksh2020 currently is not a step forward but a step back, unfortunately (to reiterate: despite my appreciation for actually having fixed a couple of minor bugs in TAB completion and output of `typeset -f' -- although breaking the latter completely in the release itself).

I also propose to stop this discussion ("how should ksh93 really be maintained? has harm been done or not?") in this issue. let's go back to just addressing the reported bug:

ksh2020 is slower by a factor of 2-3 in critical areas than ksh93u+ and ksh93v-. there must be reasons for this. it definitely is not the fault of 93v- vs. 93u+. so it is your fault. you should accept that. whether you fix it, is your decision, obviously.

jghub commented 4 years ago

repeated benchmarks under 5.0.0-29-generic #31-Ubuntu SMP Thu Sep 12 13:05:32 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

` name /bin/ksh ksh2020 (head of master) braces.ksh 0m3.77s 0m7.00s gsub.ksh 0m0.13s 0m0.11s numfor.ksh 0m0.60s 0m1.76s numloop.ksh 0m0.51s 0m1.10s parens-for.ksh 0m1.06s 0m0.10s parens.ksh 0m1.15s 0m0.11s plus-equals.ksh 0m0.06s 0m0.11s

/bin/ksh is ksh93u+ here as well. compared to what I see under OSX, the reported performance breakdowns are basically identical. a strange difference (this time in advantage of ksh2020) is seen in the 2nd and 3rd to last benchmarks. this must be an architecture thing I presume. ksh93v- not available on ubuntu for me (so not able to clarify whether 93v- already fixed this or whether this is a real achievment of ksh2020. background: parens.ksh is identical to braces.ksh except for using $(f) in place of ${ f;} and 1/100 of loops. so what this says, is: on ubuntu ksh93u+ is slow with standard command substitution and the "inline" command substitution is much (much!) faster

take home message: in the cases reported previously (benchmarks numfor, numloop, braces) for OSX, the situation is basically the same for linux/ubuntu: ksh2020 slow down by a factor of 2-3 (but slightly less pronounced than under OSX (this of course could be due to apple doing their own minor optimizations to ksh93u+...).

krader1961 commented 4 years ago

ksh2020 is slower by a factor of 2-3 in critical areas than ksh93u+ and ksh93v-. there must be reasons for this. it definitely is not the fault of 93v- vs. 93u+. so it is your fault. you should accept that. whether you fix it, is your decision, obviously.

I realize my comment you were replying to wasn't as clear is it could have been about the main cause of the performance decrease of your micro-benchmark. But I thought it was reasonably clear that not hardcoding the -Os compiler flag causes a significant increase in run-time; i.e., a significant decrease in performance. The intent of the offending change was to not interfere with various "build types" to use a Meson term. I thought we had documented the need to configure the build with meson --buildtype=minsize if you wanted a production binary but can't find any evidence we did so. At this stage I think we should change the default build type from "debug" to "minsize". Then explicitly configure debug builds in the Travis CI and document this for people who might be doing builds they want to be able to debug with gdb, lldb, etc.

The implicit change in build policy from production to debug was reasonable, and useful, during the past two years. But when the 2020.0.0 release occurred we didn't appreciate that the "debug" default build type would be an issue since there are no performance tests. So, as I stated above, we should change the default build type to "minsize" and expect environments where we need more debugging info to explicitly use that build type.

jghub commented 4 years ago

just tested this. looks better but definitely not good enough:

`

name /bin/ksh ksh2020def ksh2020minsize

braces.ksh 0m3.87s 0m9.17s 0m6.11s

numfor.ksh 0m0.70s 0m2.18s 0m1.36s

numloop.ksh 0m0.62s 0m1.24s 0m0.61s

` the last one (incrementing while loop) now is back to normal (good).

the ${ ...;} command substitution (first benchmark, using a while loop for iteration, so the numbers now -- since "while" back to normal -- are purely related to the command substitution call) still needs about 160% more time. an unacceptable "regression" in my view. (I reiterate that 93v- is not slower than ksh93u+: these are ksh2020 issues)

the 2nd one (incrementing for loop) needs 200% more time. an unacceptable "regression" in my view.

so the situation seems improved by changing compiler flags although this is only responsible for about "one half" of the observed performance breakdown.

regarding "your micro benchmark": note that there are different ones. at least two different problems (numeric for and ${... ;} command substitution) have surfaced.

I also now have benchmarked a real world ksh script/program (800 LOC) doing a mixture of data i/o, numerics and string processing: slow down by 140% with the "minsize" compiled variant, 175% with the default compile. that special program does neither use numeric for nor much ${ ...;} so the performance reduction is probably a rather generic problem.

but please accept that even the micro benchmarks capture the essence of it: you are still (with minsize) facing a typical performance reduction of about 150%, possibly more.

krader1961 commented 4 years ago

@jghub Where are these scripts published; i.e., available for download:

braces.ksh 0m6.18s 0m6.37s 0m16.24s
gsub.ksh 0m0.05s 0m0.03s 0m0.03s
numfor.ksh 0m0.67s 0m0.79s 0m2.35s
numloop.ksh 0m0.62s 0m0.65s 0m1.28s
parens-for.ksh 0m0.25s 0m0.11s 0m0.24s
parens.ksh 0m0.25s 0m0.12s 0m0.25s
plus-equals.ksh 0m0.12s 0m0.14s 0m0.18s

I can't figure out how the few code snippets you have included in comments relate to those script names or run times. I'd bet that some of the discrepancy in performance is due to a deliberate decision to not enable posix_spawn() by default. See issue #468. Using that API is enabled by meson -Duse-spawn=true when configuring the build. It was enabled by default in ksh93v- and ksh93u+ but is currently disabled by default.

P.S., Enabling posix_spawn() with the --build_type=minsize config flag actually decreases the performance of the micro benchmark in your opening comment.

jghub commented 4 years ago

as explained, these are really very simple benchmarks, each testing essentially a single property:

#!/bin/ksh

cat >numfor.ksh<<'EOF'
function repeatloop {
  typeset -i count=$1
  typeset -i i=0 buf=0
  for ((i = 0; i < count; i++)); do
     ((buf++))
     :
  done
  [[ $buf == $i && $buf == $count ]]  # assertion
}
count=1000000
repeatloop $count

EOF
cat >plus-equals.ksh<<'EOF'
i=0 s=
while ((i < 30000)); do
  ((++i))
  s+=$i
done

# assertion
[[ $s    == 12345678910*9029991299922999329994299952999629997299982999930000 && 
   ${#s} == 138894 ]]
EOF
cat >numloop.ksh<<'EOF'
function whileloop {
  typeset count=$1
  typeset -i i=0 buf=0
  while (( i < count)); do
     ((++i))
     ((++buf))
  done
}
count=1000000
whileloop $count
EOF
cat >gsub.ksh<<'EOF'
# note that jot and rs are unlikely to be present on linux

if [[ -n $BASH_VERSION && ${BASH_VERSINFO[0]} < 4 ]]; then
  echo "! skipping pathologically slow benchmark in bash $BASH_VERSION" 1>&2
  exit 1
fi

x=$(jot -r -c 80000 a z | rs -g0 0 80)
y=${x//a/A}

[[ ${#y} == 80999 && $y == *A* && $y != *a* ]]
EOF
cat >braces.ksh<<'EOF'
function f {
  echo "hello from f"
}

i=0 max=10000
while (( i < max )); do
  out=${ f; }
  (( ++i ))
done

# assertion
[[ $i == $max ]]
EOF
cat >parens-for.ksh<<'EOF'
function f {
  echo "hello from f"
}

max=10000
for (( i = 0; i < max; i++ )); do
  out=$(f)
done

# assertion
[[ $i == $max ]]
EOF
cat >parens.ksh<<'EOF'
function f {
  echo "hello from f"
}

i=0 max=10000
while (( i < max )); do
  out=$(f)
  (( ++i ))
done

# assertion
[[ $i == $max ]]
EOF

save the above script to the directory where you want the benchmark files. save it as, say, "benchmarks". than do ksh benchmarks in that directory which will generate the set of benchmark files using the indicated names in the HERE documents.

time them as "time ksh2020 numfor.ksh" etc. or write a small wrapper to run them in turn.

adjust the loop counts to sensible values (the above are too low, probably) so that the run time is not too small (order of 1 sec for each benchmark, say).

on linux you might need to additionally install "jot" and "rs" since they are not available by default (OSX is fine, though) and are used to generate input data in one benchmark.

jghub commented 4 years ago

update: as of commit g5abcbd06 (current head) performance of ksh2020 remains down by a factor of 2-3 compared to ksh93u+ and ksh93v- in several critical features like incrementing numeric for and while loops (but other things like braced command substitution or string concatenation also see a slow down by a factor of 1.5-1.7). remarkably, this is now is true again for incrementing while loops as well which intermittently seemed to have profited from changed default compiler flags (buildtype "minsize").

in any case, this rather generalized substantial performance breakdown is not cured by the "minsize" flag (my understanding is, that it is now on by default, right?). so there is something severely broken, probably quite early in this project. why is this not bisected more thoroughly than what krader did a couple of weeks ago (and being done, it seems, with changing the buildtype...)?

@siteshwar: when will this issue be addressed? it is paramount to maintain ksh93's performance. it is one of the most obvious and most relevant advantages of ksh over other shells. much easier to understand for potential new users than advantages of the additional features of KornShell language.

performance should be restored fully before going on in my view. it will not become easier further down the road.

dannyweldon commented 4 years ago

To pipe in here, I do remember there being several changes to what looked to me like critical tight loops to simplify the code with I assume the expectation that the compiler would optimise it. Perhaps these could be investigated and tested with these benchmarks. And is any compiler optimisation actually happening currently?

jghub commented 4 years ago

@dannyweldon, @siteshwar :

FYI information, I did some further benchmarking, also adding one for recursive function calls. result:

---------------------------------------------------------------------
name             /bin/ksh  ksh93v-  ksh2020nms  ksh2020ms  ksh2020rel
---------------------------------------------------------------------
braces.ksh       0m0.26s   0m0.29s  0m0.68s     0m0.37s    0m0.40s
fibonacci.ksh    0m0.38s   0m0.53s  0m1.06s     0m0.61s    0m0.69s
gsub.ksh         0m0.21s   0m0.11s  0m0.12s     0m0.11s    0m0.12s
numfor.ksh       0m0.69s   0m0.83s  0m2.25s     0m1.42s    0m1.53s
numloop.ksh      0m0.65s   0m0.67s  0m1.29s     0m0.65s    0m0.76s
parens.ksh       0m0.27s   0m0.13s  0m0.24s     0m0.15s    0m0.17s
plus-equals.ksh  0m0.12s   0m0.14s  0m0.21s     0m0.13s    0m0.14s
---------------------------------------------------------------------

explanation:

run on OSX
/bin/ksh is ksh93u+
ksh2020nms is today's tip of master compiled w/o --buildtype=minsize, ksh2020ms is compiled w/ that flag
ksh2020rel is the release as provided by Macports for OSX

the following can be concluded

--buildtype=minsize is beneficial (but I have no idea whether and how compiler optimizations are active already - the question of @dannyweldon was not answered so far)
ksh2020ms (and ksh2020rel, seemingly compiled the same way) is notably slower than 93v- at the 20-30% level most of the time but by 85% (a factor of 1.85 slower, that is) for a simple numeric for loop doing exactly the same thing as the following benchmark ("numloop.ksh") which uses a while loop for iteration. this is really cause for special concern. but even sacrificing 30% performance should be avoided by all means.
it is thus a proven fact that krader is erroneously maintaining (most recently here: https://groups.google.com/forum/#!topic/korn-shell/t3yQjw82m1Q) that there is "almost no" slow down of ksh2020 relative to the starting point of this project (93v-). quite to the contrary, the slowdown is notable (at the 20-30% level, partly rather dramatic (85%)). the reasons for this should be investigated.
the situation is even worse (in my view, anyway) if one asks the ultimately more relevant question "how does ksh2020 compare to ksh93u+?" as inspection of the above table shows, there are a few instances where ksh2020 is faster than ksh93u ((because it did not completely undo the speedup achieved by 93v- relative to 93u+ for the respective benchmarks...), specifically this concerns global string substitution and $(...) command substitution/subshell spawning. but overall, performance clearly is much inferior to 93u+ and the performance reduction approaches a factor of 2 for recursive function calls ("fibonacci.ksh") and exceeds a factor of 2 for numerical for loops.

apart from everything else that is going wrong in this project (and that's quite a lot, unfortunately): I do hope that at least the existence of an underperformance issue can now be acknowledged by everyone involved and hopefully be addressed by looking for the root cause(s) rather than declaring it "solved" by using --buildtype=minsize.

krader1961 commented 4 years ago

I scaled the "braces" micro-benchmark by 50 so it has a meaningful run time on my system. I then ran it five times at the following commit points and picked the best (fastest) user mode CPU time from each run. This is the result:

HEAD~100  6.917
HEAD~200  6.928
HEAD~300  6.697
HEAD~400  6.553
HEAD~500  7.014
HEAD~600  6.741
HEAD~700  6.818
HEAD~800  6.929
HEAD~900  6.761
HEAD~1000  6.991
HEAD~1100  6.763
HEAD~1200  6.523
HEAD~1300  6.554
HEAD~1400  6.494
HEAD~1500  6.702
HEAD~1600  6.604
HEAD~1700  6.624
HEAD~1800  6.600
HEAD~1900  6.630
HEAD~2000  6.506
HEAD~2100  6.725
HEAD~2200  6.837
HEAD~2300  6.763
HEAD~2400  6.665
HEAD~2500  5.302
HEAD~2600  0.0
HEAD~2700  5.229
HEAD~2800  5.190
HEAD~2900  5.691
HEAD~3000  5.514
HEAD~3100  0.0
HEAD~3200  4.883

Notice the big increase between HEAD\~2400 and HEAD\~2500. It turns out the increase is due to commit 489c672d to "Replace calls to AST Vmalloc with system malloc". See issue #396. We're certainly not going to reinstate AST Vmalloc. There is simply too much value in being able to use tools like Valgrind and ASAN which is only possible if we're using the system malloc. And in theory there shouldn't be any reason why using the system malloc is noticeably slower so this needs further investigation.

For comparison, on the same system and using the same benchmark script the distro ksh93u+ binary clocks in at 4.668 CPU seconds and ksh93v- (which uses AST Vmalloc) at 5.031 CPU seconds. A 8% increase.

P.S., I never "declared it "solved" by using --buildtype=minsize." I said that addressed most of the performance decrease. A debug build runs the benchmark in 11.458 CPU seconds. Which means a debug build accounts for ~4.5 seconds of the increased run time versus ~1.7 seconds due to switching to the system malloc. That is, a debug build causes a slow down 2.6x that of switching to the system malloc.

krader1961 commented 4 years ago

A preliminary comparison of the execution profile of the two commits shows that dtclose() and dtopen() are quite a bit slower using the system malloc subsystem compared to the AST malloc subsystem. A lot more analysis will be needed but I'm willing to bet a small amount of money that changing memoryf() in src/lib/libast/cdt/dtnew.c to use realloc() rather than vmresize(), and the related changes in that module, is a part of the reason for the slow down. This is based on a review of commit 489c672 in light of what the run-time profiling reveals. Specifically, the old code relied on the use of "vmalloc regions" to optimize the layout of CDT data structures. Presumably to maximize CPU cache utilization. Whether equivalent performance can be achieved without relying on non-standard features of the AST malloc subsystem is an open question.

krader1961 commented 4 years ago

The treeevent() function in src/lib/libast/cdt/dttree.c is an order of magnitude slower using the system malloc because it is freeing a lot of very tiny allocations when called from dtclose(). Even worse, it's doing so via an indirect (*dt->memoryf) call. Which might have been a useful abstraction in the past but is now effectively a constant invocation of dtmemory().

jghub commented 4 years ago

first of all: I welcome any initiative to finally start looking into the issue at a technical level (instead of dodging the issue and acting silly when faced with criticism). if, in the process, (at leat part of) the damage done to the code can be reverted, this would be progress. now to the details (in separate posts to avoid it becoming too unwieldy).

jghub commented 4 years ago

see above...

setting the historical/factual record straight.

your claim above:

I never "declared it "solved" by using --buildtype=minsize." I said that addressed most of the performance decrease.

fact: your statement I referred to when saying that was verbatim:

most of what was said about compatibility and performance slow down is incorrect. Almost all of the slow down is due to not building with --buildtype=minsize. It has almost nothing to do with changes made in the past two years.

this is a paraphrase of "issue solved" for me (since it allegedly has "almost nothing" to do with the code).

looking at the numbers for the whole set of benchmarks (see the table I provided https://github.com/att/ast/issues/1449#issuecomment-579311448) I think the conclusions I have drawn from the table in that post are correct. specifically 20-30% (and up to 85%) slow down of ksh2020ms vs. 93v- is not adequately described by "almost nothing".

jghub commented 4 years ago

memory management: you say

Notice the big increase between HEAD~2400 and HEAD~2500. It turns out the increase is due to commit 489c672 to "Replace calls to AST Vmalloc with system malloc".

this was my suspicion and that of knowledgable colleagues right from the start: "that has to have to do with changing the malloc stuff". you also say

"We're certainly not going to reinstate AST Vmalloc" etc.

that should not be decided by you alone, hopefully. if it does turn out that the malloc stuff is the cause of all (partly massive) slow down, this would need reassessment. there is no justification whatsoever to sacrifice the excellent performance achieved by Korn et al over 20 years or so of work at ksh93.

of course I also would expect that other changes you have made to the code also hurt performance -- but this is a conjecture and it would be necessary to very very thoroughly look into each single benchmark, notably those which behave especially bad, and to restore performance to what is was before you started (at least: then there would still be the task to achieve a performance comparable to that of 93u+, which frequently, but not always, is noticeably faster than 93v-)

jghub commented 4 years ago

since I do care very much for the performance of ksh (as, I am sure, many ksh users will), I wrote a few further benchmarks. a typical comparison (same conditions as here: https://github.com/att/ast/issues/1449#issuecomment-579311448) looks like this (ksh2020 w/o minsize, ksh2020ms w/ minsize, yesterday's tip of master)

---------------------------------------------
name             /bin/ksh  ksh2020ms  ksh2020
---------------------------------------------
braces.ksh       0m0.37s   0m0.60s    0m0.98s
fibonacci.ksh    0m0.36s   0m0.66s    0m1.08s
gsub.ksh         0m0.17s   0m0.11s    0m0.11s
numfor.ksh       0m0.65s   0m1.45s    0m2.14s
numloop.ksh      0m0.57s   0m0.61s    0m1.18s
parens-for.ksh   0m0.19s   0m0.09s    0m0.15s
piln2.ksh        0m1.20s   0m1.36s    0m2.76s
plus-equals.ksh  0m0.09s   0m0.10s    0m0.12s
qsort.ksh        0m1.84s   0m2.05s    0m3.83s
rand.ksh         0m2.36s   0m2.60s    0m4.80s
sample.ksh       0m1.61s   0m2.00s    0m3.34s
shuffle.ksh      0m2.54s   0m3.40s    0m6.23s
---------------------------------------------

everybody can do his own ratios but the "lowlights" are:

braces is 62% slower in ksh2020ms (and ksh2020nms is another 63% slower than ksh220ms which again demonstrates that the minsize flag is only about "half of the slowdown story" )
fibonacci 83% slower,
numfor 123% (or a factor of 223%, if you like) slower,
shuffle 34% slower.

note that these are totally "random" benchmarks not selected/constructed to demonstrate ksh2020 deficiencies. there might be better benchmarks, so proposals would be welcome.

the source code of most of these benchmarks was already listed here: https://github.com/att/ast/issues/1449#issuecomment-559871302

the code for the newer ones follows:

# ===========
# fibonacci.ksh
# ===========
function fib {
   typeset -i n=$1
   if ((n < 2)); then
      echo $n 
   else
      typeset -i f1 f2
      f1=$(fib $((n - 1)))
      f2=$(fib $((n - 2)))
      echo $((f1 + f2))
   fi
}
function fibs {
   typeset -i n=0 res nmax=$1 verbose=$2
   while ((n <= nmax)); do
      res=$(fib $n)
      ((verbose)) && echo $res
      ((++n))
   done
}
fibs 19

# =======
# piln2.ksh
# =======
function piln2 {
   typeset -lE x=0.0 y=0.0 z=0.0 pi ln2
   typeset -i i loopcount=$1 report=$2
   for ((i = 1; i <= loopcount; i ++)); do
      ((x += 1.0/(i * i)))
      ((z += -1.0**i/i))
   done
   pi=$(print -f "%.6f\n" -- $((sqrt(6 * x))))
   ln2=$(print -f "%.6f\n" -- $((-z)))
   if ((report)); then
      print $pi
      print $ln2
   fi
   ((pi == 3.141592 && ln2 == 0.693147)) || return 1
}
piln2 1000000

# =======
# qsort.ksh
# =======
function qs {
   typeset vals=$1 dum
   typeset -i lo=$2 hi=$3 i j pivot 
   typeset -a ara
   i=$lo
   j=$hi
   while ((i < hi)); do
      ara=($vals)
      pivot=${ara[$(((lo + hi)/2))]}
      while ((i <= j)); do 
         while ((ara[i] < pivot)); do ((i++)); done
         while ((ara[j] > pivot)); do ((j--)); done
         if ((i <= j)); then
            dum=${ara[i]}
            ara[i]=${ara[j]}
            ara[j]=$dum
            ((i++))
            ((j--))
         fi
      done
      vals=${ara[@]}
      if ((lo < j)); then
         vals=$(qs "$vals" $lo $j)
      fi
      lo=$i
      j=$hi
   done
   echo $vals
}
function qsort {
   typeset -i n=$1 m=$2 report=$3
   typeset res
   ((m > 0)) || m=$n
   typeset vals=$(jot -r $n 1 $m)
   res=$(qs "$vals" 0 $((n-1)))
   ((report)) && echo $res || return 0
}
qsort 2000 10000

# ======
# rand.ksh
# ======
function rand {
   typeset -i rnum rmin rmax
   typeset -i i count minhit maxhit=0 n=$1 report=$2
   typeset -A buf
   for ((i = 1; i <= n; i++)); do
      ((buf[$RANDOM]++))
   done
   minhit=$n
   for rnum in ${!buf[@]}; do
      count=${buf[$rnum]}
      if ((count < minhit)); then
         minhit=$count
         rmin=$rnum
      fi
      if ((count > maxhit)); then
         maxhit=$count
         rmax=$rnum
      fi
   done
   ((report)) && echo ${#buf[@]}/$n,$rmax:$maxhit,$rmin:$minhit || return 0
}
rand 1000000

# ========
# sample.ksh
# ========
function sample {
  typeset -i m=$1 n=$2 report=$3 i num rnum hit i
  float runif
  typeset nums=""
   for ((i = n - m + 1; i <= n; i++)); do
      runif=$((RANDOM/32767.000000001))
      rnum=$(( 1 + i * runif ))
      hit=0
      for num in $nums; do
         ((rnum == num)) && { hit=1; break; }
      done
      ((hit == 1)) && nums+=$i || nums+=$rnum
      nums+=" "
   done
   ((report)) && print -- $nums || return 0
}
sample 3000 1000000

# =========
# shuffle.ksh
# =========
function shuffle {
   typeset entries=$(seq 1 $1)
   typeset -i index j report=$2
   typeset ara dum
   typeset -E runif
   set -A ara $entries
   typeset n=${#ara[@]}
   for ((i = n - 1; i >= 1; i--)); do
      runif=$((RANDOM/32767.0000000001))
      j=$(( (i + 1) * runif ))
      dum=${ara[j]}
      ara[j]=${ara[i]}
      ara[i]=$dum
   done
   ((report)) && print -- "${ara[*]}" || return 0
}
shuffle 300000

how to use:

as called above, output is suppressed to not interfere with time output.
a trailing argument of 1 in the call, eg. qsort 10 100 1, prints the results to stdout
what the benchmarks do:
- fibs 19: the first 19 entries of the fibonacci series using the (slooow) recursive algorithm (so should test recursive function calls)
- piln2 1000: compute pi and ln(2) from first 1000 elements of infinite series representations. should test mostly floating point arithmetic
- qsort 10 100: draw 10 random samples from [1, 100] and quicksort them. mostly shuffles an indexed array.
- rand 1000: draw 1000 random integers from range [0,32767] using $RANDOM and look for "hash collisions" (same "random" number drawn multiple times). since $RANDOM is unsigned short int, it is a rather small set of distinct values.... should test associative array access.
- sample 10 100: 10 random samples from the range [1,100]. mostly floating point calculations and string concatenation
- shuffle 10: random permutation of the seq 1:10. shuffles elements of an indexed array (more heavily than qsort with the default settings). (note that, both, sample and shuffle used $RANDOM so are flawed as a real solution to the specified tasks (one would need a real random number generator for that) but are OK as benchmarks or for small sample sizes and sets (<<32767) from which the samples are drawn)

I do hope these additional benchmarks demonstrate that the performance reduction of ksh2020 vs. ksh93u+ can be seen in many places (and where it is not seen, it is the merit of ksh93v-, not ksh2020).

my best guess is that what is seen here is a combination of the malloc replacement and "code tidy up" in places that better had been left alone.

the especially strange behaviour of numfor (which is basically an empty numerical for loop just doing the loop incrementing) in comparison to replacing it by a while loop (numloop) might be a good starting point to test this conjecture.

but braces (i.e. braced command substitution) deserves addressing as well of course.

krader1961 commented 4 years ago

Wow! That anyone thinks ksh is the appropriate tool for tasks such as computing the value of pi, other than in a "can this be done" way, blows my mind. I have to wonder how many languages that person has mastered.

krader1961 commented 4 years ago

@jghub You wrote

this was my suspicion and that of knowledgable colleagues right from the start: "that has to have to do with changing the malloc stuff".

Care to provide some proof of that statement? Had that been your "suspicion", and that of your "knowledgable colleagues", why didn't you mention it when you opened this issue? You wrote "consider a stupid micro benchmark" in your original problem statement and I'm inclined to agree.

jghub commented 4 years ago

@krader1961 :

Wow! That anyone thinks ksh is the appropriate tool for tasks such as computing the value of pi, other than in a "can this be done" way, blows my mind. I have to wonder how many languages that person has mastered.

need I explain to you the meaning of the word "benchmark"? do you object against the benchmark computing something where the correct result is known a priori and can be asserted? what is your problem except that ksh2020 is overall much slower than ksh93u+ (although"only" 13% in this special floating point benchmark)? it is quite obvious that you don't want to admit that the slowdown is your fault (you haven't). you would prefer to deny existence of the problem (you have). you also might consider to deny it's relevance: actually, I am a bit disappointed that you have not yet done that ("ksh2020 is much slower than ksh93u+, but nobody wants a fast shell anyway since it is only a command interpreter like fish and ksh is not a serious scripting language etc etc etc pp").

Care to provide some proof of that statement?

care to adjust your tone? care to just read through this very thread? https://github.com/att/ast/issues/1449#issuecomment-559092391. hint: search for the secret word "memory".

also: care to explain to the world, why you managed to plow ahead for two years without ever caring to write your own small scripts for performance monitoring? and then making a "stable" release (build(ing) w/o the minsize flag) that was/is about a factor of 2-3 slower across the board than ksh93u+? care to give this engineering feat a grade from A to F?

so far you maintained (do you, still?) that the problem has "almost nothing to do with [code] changes made in the past two years". that is obviously completely (and wittingly) untrue. all of the remaining issues (assuming minsize as a future default build option) we are now talking about are due to code changes. whether your changes to the memory management or your "code tidy up" and your "optimisations" takes most of the responsibility for the slow down or both are equally the culprit remains to be seen.

you wrote "consider a stupid micro benchmark" in your original problem statement and I'm inclined to agree.

you do agree with what? just let the readers know.

I (and several other people) have received a clear message from you again and again: you don't want real feedback. you don't want being pointed to real problems. you don't understand them. you deny them. you don't like being caught in the act of telling falsehoods to the readers here and on the mailing list. you don't correct your attitude. you don't listen. your reactions reliably are rude and arrogant. you are fighting a fight against the code (and its creators) and against the users of ksh who are noting that something wrong is happening due to your interference and are trying to make you realise that you need to correct your approach. you are short-sighted and rather clueless in all things ksh. that will not change. I know that by now. really.

so please understand that my posts here are just documenting the problems of ksh2020 for the future. whether you acknowledge or deny them or ignore them is of secondary concern. I posted the benchmark tables and the code for people to copy+paste and see for themselves. somewhere down the road this might help to repair the damage you have done.

DavidMorano commented 4 years ago

Just a little note about VMALLOC (performance and otherwise):

I would just like to point out that VMALLOC was not removed for no reason at all. It needed to be removed in order to allow new built-in commands (plugins, whatever) to use multiple threads (among possible other reasons also). For whatever reason, the current VMALLOC implementation had some sort of problem with not being thread safe -- even though there was some kind of compile option (if I remember correctly) to make it thread safe. I was not able to investigate the current VMALLOC deficiency enough to find out what the problem was with it. So rather, a long time ago I experimented with replacing VMALLOC with the system MALLOC implementation instead (which is indeed thread-safe). Yes, it fixed the problem. Further, I am not the only one to be using multi-thread plugins, others have done it also (and they likely may have had to remove VMALLOC also).

Now about any possible performance loss due to replacing VMALLOC w/ regular system MALLOC: If VMALLOC is somehow slower than your (I repeat, your) system MALLOC, then you are free to put VMALLOC back into KSH when you built it for yourself. For myself, I value having multi-thread plugins much more than any possible (likely slight) benefit that VMALLOC might have had over regular MALLOC. Also. most systems come with at least three MALLOC versions (usually more). Usually or often of the versions of MALLOC supplied (through different libraries) there are often these three types: 1) fast but not space efficient, 2) a mix of fast and space efficient (usually the system default library implementation), and 3) slow but much more space efficient. You are TOTALLY welcome to replace your regular system MALLOC with either VMALLOC (from AST team) or with the super fact version as distributed with your distribution. If your distribution does not supply a super fast MALLOC library, then just go and hunt the web and get your own.

The writer of VMALLOC is now at Google (still is as far as I know), but I have never heard of him fixing up VMALLOC to be properly thread-safe (from whatever deficiency it currently suffers from in that regard). So I would prefer that some real system MALLOC (of whatever variety) continue to used as the default for KSH, as long as it is thread-safe (which they normally all will totally be). If somehow VMALLOC does become the default again, then I will take it upon myself to replace it (again) with a regular -- correctly operating and thread-safe -- system MALLOC (to other MALLOC implementation known to work).

But before using VMALLOC, I would suggest exploring the use of a super fast regular MALLOC replacement -- that is out there - first. VMALLOC may have had other bugs which we did not know about also (who knows).

jghub commented 4 years ago

@DavidMorano I appreciate this information. thank you. I understand that there are different use cases and metrices how to define "most suitable". and of course I don't object (why should I?) against your priorities implying that you might be willing to sacrifice performance for thread-safety (if I get that right).

but as you know, very (very!) few people will ever consider to write their own builtins. so for the overwhelming majority it is simply more relevant how fast their ksh scripts execute. the default built of ksh in my view should serve this majority. possibly, built options can be added to select the malloc system to use? I don't know much of these things...

the whole discussion here only revolves about a performance comparison of what /bin/ksh gives me and what ksh2020 provides. it is a description of the state of affairs. nothing more. it is not even a proven fact, that memory management is the main cause (it might be other code changes by krader). conclusions to be drawn might differ, depending on perspective. I for one firmly believe that performance (once achieved...) should not be given up lightly. for my purposes I am much better served by ksh93u+, therefore.

as for "VMALLOC may have had other bugs": sure, but I very rarely, if ever, encounter segfaults (or other misbehaviour that would hint at VMALLOC bugs) of ksh93u+ (and if so, they are of the stack overflow variety so far...). so it seems more of a theoretical than a concrete problem, no?

DavidMorano commented 4 years ago

@jghub

but as you know, very (very!) few people will ever consider to write their own builtins. so for the overwhelming majority it is simply more relevant how fast their ksh scripts execute. the default built of ksh in my view should serve this majority. possibly, built options can be added to select the malloc system to use? I don't know much of these things...

Yes, I agree with you above. The vast majority will not be using built-ins that are multi-threaded, at least not yet. There is -- at least at this stage of the world's usage -- little point in optimizing for that. Although most people will not write their own multi-threaded builtins, at some point pre-written multi-threaded builtins might replace some of the existing common builtins. But I still object to using VMALLOC since it demonstrated one headache already (buggy non-thread-safety). I would still advocate for some other MALLOC -- possibly optimized for speed over space efficiency. Yes, perhaps build options can, at some point, be provided to select possible MALLOC implementations (although a builder should be able to easily hack this themselves also as they desire).

the whole discussion here only revolves about a performance comparison ...

Yes, I have followed the discussion and the relatively dramatic loss in performance is very interesting. Yes, I agree that there should be some investigation into that. And, yes, it may have nothing to do w/ the replacement of VMALLOC. I would be very interested in the results of such an investigation. Also, I agree with you and others that it is a shame that there had not been some performance checks incorporated into the continuous integration environment, so that the particular commit that caused any single performance loss could have been identified. At this stage, there may have been many commits that contributed to performance losses in different ways. We all live and learn though, and will consider things like this going forward. I do feel that contributions from you and others more recently should be welcomed, because I have to say -- at least for myself -- that I had not considered that there would have been significant performance losses in this project until they were pointed out (above)!

... so it seems more of a theoretical than a concrete problem, no?

Yes, you are correct. And in all fairness, although I am suspicious of VMALLOC (because it is something that has not had wide usage in the world except for being in KSH itself), I cannot claim that I know of any other bugs in it, other than it had some sort of thread-safety bug. Yes, I freely admit my bias against VMALLOC, because it caused some hard headaches for me to track down (debug) that it was causing me problems. :-)