Compilation slows down drastically as the number of multi commands increase

kaushalmodi commented 4 years ago

Hello,

I happen to have a really large dispatchMulti: http://ix.io/2k7q/nim

In that snippet, if I do not have the subcommands after "# Commands passed directly to p4", my compilation time is 18 seconds. But with all those lines after that point, the time goes up to 54 seconds.

Each of those function looks like this (calling that same template but with a different subcommand string:

template passThruToP4(subcommand, switchesAndArgs: untyped): untyped =
  if switchesAndArgs.len == 1 and switchesAndArgs[0] in @["-h", "--help", "help"]:
    discard getP4Cmd("help", @[subcommand]).execCmd()
  else:
    discard getP4Cmd(subcommand, switchesAndArgs).execCmd()

## aliases
proc aliases(switchesAndArgs: seq[string]) =
  ## Display the content of the P4ALIASES file.
  passThruToP4("aliases", switchesAndArgs)

The compile time bump due to the subcommands added to dispatchMulti is obvious because if I comment out those lines after "# Commands passed directly to p4", time goes down from 54s to 18s.

Versions

nimterop: 0.9.45
nim: 1.3.1 built from devel today

SOLUTION

It was just that I was always building with -d:release (which happened automatically via nimble install). If I simply do nim c foo.nim, the whole project took 13.37s only.

c-blake commented 4 years ago

I don't think this is a real issue with cligen (or maybe even Nim/the Nim VM running the macro machinery). It looks like you have 96 subcommands in the slow case vs 16 in the fast case or 6x as many subcommands. The slow case is only 54/18 = 3x slower, and 54 seconds amortized over almost 100 commands is only half a second per command. So, scaling is already sublinear with number of commands which I think means "drastically" in the issue title is unwarranted.

It makes sense to be a little sublinear since those passthrough commands have no named options. So, most of the big case-of switch is not present as well as the CPU getting more time to be warmed up on a bigger task. I do not think compile time can be drastically improved without losing features. You could definitely reduce the Nim/C code footprint by unroll dispatchMulti manually to have it "route" to just one leaf dispatcher for 80 commands under different names, but we cannot do that generally or else all those "subcommand tuning" in each [] of dispatchMulti is lost. Only you in this one application know you can elide all that tuning ability.

You can see how much Nim code must get compiled if you compile with nim c -d:printDispatch and re-direct to a file. You can also see how much C code has to be built from that giant pile of Nim code by looking a the Nim cache directory. It is not at all clear to me if the (still sublinear!) slowdown comes more from the Nim side or the C side. Having your default no-danger/release mode nim.cfg/config.nims use gcc -O0 may get you a speed-up, and the amount of that speed-up could inform a bit (or inform more with more sophisticated profiling).

In terms of practical solutions, I suspect with TinyCC aka tcc you could get a big speed-up on the C side in non-optimized-edit-compile-test mode, even over gcc -O0 or clang -O0. I'm not sure if it's something just in my personal nim.cfg/tcc set up, but tcc compilation broke for me in Jan/Feb sometime. See also discussion over at https://github.com/c-blake/cligen/issues/137. I would be very interested to hear if you can get tcc working at all. I used to be able to run my test.sh which compiles and runs over 50 programs in only 2 seconds with tcc vs 20 seconds now with gcc -O1/default. So, end-to-end, Nim+tcc is about 10x faster which is what makes me bias toward a this-is-a-C-compiler performance problem.

Also, nim c -c FullyAutoMulti.nim takes about 1 second (from a cold Nim compile ~/.cache) while nim c FullyAutoMulti.nim takes 3.6 seconds. So, the C compiler is 2.6/3.6=72% of the problem..maybe much more for your 100 subcommands case.

I'm not sure there is much else I can say/do except express solidarity/hope that we can get tcc working again (we may need to bug Araq). Nim does do parallel C compiles. It might be possible to lower wall clock by having dispatchMulti put each generated dispatch in a different Nim translation unit. BUT...A) I'm not sure that would even speed things up, and B) it may well introduce scoping complexity that breaks other things or creates really hairy workarounds and C) definitely would introduce compile-time generation of new .nim files which is kind of a mess.

kaushalmodi commented 4 years ago

I believe there a room for some optimization somewhere because the whole Nim compiler compiles in lesser time than this application for me. 0.5 second for one subcommand seems a lot.

I'll ask in Nim forum if someone has any tips on how to profile compilation times i.e. find out what piece of code took how much time to compile.
I'll also experiment with different combinations of Nim and cligen versions.

I don't have tcc installed but let me see if I can find the root cause of this slowdown. It's really surprising that the same gcc compiler takes a minute to compile this little CLI app, which compiles whole GNU Emacs in 4-5 minutes.

It doesn't even seem to be a warm-up issue.. I edit one character in the code and recompile, and it still takes 1 minute.

My plan B is that I comment out most of the commands from the dispatchMulti list as I am developing new stuff in this application, but it's of course not very practical.

I'll keep you posted.

c-blake commented 4 years ago

If you just guard all those pass through commands with one big when defined(foo) then it's just nim c -d:foo which isn't so impractical. You could even flip the logical sense if you tend to forget -d:foo so you have to -d:pass to get the pass throughs.

18s is still a while, though. I'd bet with tcc that could be under/about 2s which is much more fluid in human attention terms.

Also, I think you misunderstood my warm up comment. That was to explain the sub-linear scaling - why 6X more Nim code takes merely 3X more wall time. Your alter-Nim-recompile-test is more about Nim ~/.cache efficacy. BTW - you should probably ensure that directory is on a very fast IO device, e.g. /dev/shm, via a symlink or via the --nimcache:DIR Nim compiler option.

c-blake commented 4 years ago

And you should definitely try nim c -c and at least gcc -O0 and clang -O0 to track down Nim vs C compiler vs C optimizer time.

kaushalmodi commented 4 years ago

--nimcache:DIR Nim compiler option.

That's a great tip. Let me try that today. By default the cache is in /tmp. Let me see if making that local fixes this.

c-blake commented 4 years ago

Well, I just took your ix.io link and text-editor massaged it into a bunch of proc foo() = discard with the non-empty param lists for the 10 procs with non-trivial param lists. I get just 1.3 second nim c -c compiile time from a warm Nim compile ~/.cache. On my machine it takes 20 seconds to get to a binary executable from gcc -g3 -Og but only 10 seconds with gcc -Og and then only 5 seconds with gcc -O0. There is a 91,000 lines of code C file for the main module. I think tcc could compile that easily sub-second even on a laptop and even with the implied .h include files. I've seen it go at over 4 million lines of code/sec. In short, I am pretty sure this is a C compiler/backend issue and not a Nim/cligen issue, and I'm also pretty sure that if you got tcc working you would be sub-2second edit-compile-test even without the when guards.

c-blake commented 4 years ago

Of course, it may take multiple minutes for gcc -O3 to chew on it, but that is a release/benchmark-time-only operation which is hopefully rare.

And running clang -I/usr/lib/nim/lib -w -O0 *.c -o junk inside ~/.cache/nim/c/prog took only 3.3 seconds (So, +1.3 =~ 4.6 sec total.)

@pb-cdunn had suggested Zig as a backend. I kinda doubt it'll be very competitive with tcc, but it might be a little faster than clang -O0.

I've used tcc since 2015, though, and someone should really get that working again (unless it's only broken "for me" in which case I should get my local installation working again).

kaushalmodi commented 4 years ago

Thanks, I was defaulting to run with -d:release every time. Now doing that and using -O0 should certainly be better :)

I've used tcc since 2015, though, and someone should really get that working again (unless it's only broken "for me" in which case I should get my local installation working again).

Have you filed a big report for that?

Thanks.

c-blake commented 4 years ago

Yeah. I bet with -O0 you get back down to 5 second territory or 10x boost with no change to cligen. :-)

I have not filed a bug report as I'm not certain if it's just "my" tcc (which I get from the "mob" branch at https://repo.or.cz/tinycc.git) or some more official version stamped one or all that. I strongly suspect it's a Nim issue, but it is definitely work for me to prove that it is. I also unsure what version of tcc is "officially supported" on x86_64. Also, when it first happened the error message took me straight to some XXX/TODO item. So, I had assumed whoever broke it knew, but that may be mistaken.

If compile-times are getting you down, though, then you should definitely follow up on this. Nothing I have ever seen comes close to tcc. It's always been at least 10x faster than various -O0 modes. It's "one pass" with an ELF linker built right in.

Why, to get the C error messages "bolded", I patch the program as piping the output to awk or any other colorizer slows it down by a factor of several (and it does not write a lot of output - much slowdown is just from execvp of another program!). So, in point of fact, "my tcc" is in some strict unique to me, although I would be stunned if my colorized diagnostics impact any of these failures.

c-blake commented 4 years ago

Prompted by all this discussion I looked into the error more closely and the Nim compiler emits C that does not compile on any backend, but only when cc=="tcc":

typedef signed long int int64_t;
typedef int64_t NI64;
typedef NI64 NI;
__builtin_saddll_overflow(NI a, NI b, (long long int*)NI* res) { }

In case it doesn't pop out at you, there is some kind of C type cast syntax for res inside the parameter list. This is probably specific enough to file an issue for. nim c shouldn't generate illegal C like that for any of its backend-compiler branches.

kaushalmodi commented 4 years ago

I am certainly lost when I read that code because I have only learnt a little bit more C after I picked up Nim.

A quick update: not using -d:release certainly helped even without removing those passthrough procs.

Below times are after deleting the .cache dirs for this project.

`time nim c src/p4x.nim`

13.418u 6.403s 0:13.37 148.1% 0+0k 6240+104120io 0pf+0w

`time nim -d:release c src/p4x.nim`

60.441u 6.477s 0:51.63 129.5% 0+0k 4152+116104io 0pf+0w

I cloned and installed tcc from its latest master (_tcc version 0.9.27 (x8664 Linux)), but it failed as you have seen.

> time nim --cc:tcc c src/p4x.nim   

CC: stdlib_assertions.nim
CC: stdlib_dollars.nim
CC: stdlib_formatfloat.nim
CC: stdlib_io.nim
CC: stdlib_system.nim
CC: stdlib_parseutils.nim
CC: stdlib_math.nim
CC: stdlib_algorithm.nim
/tmp/kmodi/.cache/nim/p4x_d/stdlib_parseutils.nim.c:222: error: invalid type
/tmp/kmodi/.cache/nim/p4x_d/stdlib_io.nim.c:256: error: invalid type
CC: stdlib_unicode.nim
/tmp/kmodi/.cache/nim/p4x_d/stdlib_math.nim.c:48: error: invalid type
/tmp/kmodi/.cache/nim/p4x_d/stdlib_system.nim.c:644: error: invalid type
CC: stdlib_strutils.nim
/tmp/kmodi/.cache/nim/p4x_d/stdlib_algorithm.nim.c:142: error: invalid type
CC: stdlib_pathnorm.nim
/tmp/kmodi/.cache/nim/p4x_d/stdlib_unicode.nim.c:98: error: invalid type
/tmp/kmodi/.cache/nim/p4x_d/stdlib_strutils.nim.c:228: error: invalid type
Error: execution of an external compiler program 'tcc -c -w   -I/home/kmodi/usr_local/apps/6/nim/devel/lib -I/home/kmodi/sandbox/nim/p4x/src -o /tmp/kmodi/.cache/nim/p4x_d/stdlib_io.nim.c.o /tmp/kmodi/.cache/nim/p4x_d/stdlib_io.nim.c' failed with exit code: 1

/tmp/kmodi/.cache/nim/p4x_d/stdlib_pathnorm.nim.c:68: error: invalid type

Here is that stdlib_pathnorm.nim.c.

c-blake commented 4 years ago

Actually, the above code is expanded (and trimmed) tcc -E output. The C source code looks like:

static N_INLINE(NIM_BOOL, nimAddInt)(NI a, NI b, NI* res);

So, now I suspect it's likely a bug in the tcc C preprocessor, not a bug in Nim.

c-blake commented 4 years ago

Yeah. It fails on a Nim file with just echo 0 as source. :-) If you want a simpler reproduction that only has 8,000 lines of C code.

kaushalmodi commented 4 years ago

Oh well, not using -d:release during development certainly works for me. Thanks for that tip.

kaushalmodi commented 4 years ago

Yeah. It fails on a Nim file with just echo 0 as source.

You should definitely file a bug report so that this is tracked for nim+tcc users. That would also serve as something to point to to the tcc devs.

c-blake commented 4 years ago

But THANK YOU for cloning tcc/jumping in. With my echo 0 test the failure is only at line 53 of the stdlib_io.nim.c. Stepping through tcc in gdb might even work to track down why tcc -E wedges a cast in there erroneously. If we get this working we can get your now 13.4 seconds below 2 seconds.

c-blake commented 4 years ago

tcc is just about the simplest/cleanest compiler source code I've ever read. If you ever wanted to write your own compiler all the way to machine code, it would not be a bad starting point. I may do better becoming "part of the angry mob" of tcc hackers with a proposed fix than with just a bug report. ;-) Anyway, uncertainty if the issue was with Nim or tcc also held me back from filing a Nim github issue, and in this case the uncertainty has been resolved in favor of my laziness. ;-)

c-blake commented 4 years ago

In fact, just taking that snippet I posted up in https://github.com/c-blake/cligen/issues/139#issuecomment-621813940 and feeding it to tcc -E reproduces the erroneous (long long int*)NI* res. So, that's a pretty minimal reproducing bug!

kaushalmodi commented 4 years ago

Feel free to ping here.. I'd be happy to try out the tcc patch that fixes this.

c-blake commented 4 years ago

Well, I misspoke about the tcc bug reproducer (I accidentally cut & paste the bug itself -- oops!), but

#define NIM_INTBITS 64
#include "nimbase.h"
static N_INLINE(NIM_BOOL, nimAddInt)(NI a, NI b, NI* res);

does reproduce the tcc -E problem with only like 300 lines of output. Investigating further.

c-blake commented 4 years ago

Aha...Maybe you had already noticed, but the function name is also wrong. It should be nimAddInt not __builtin_saddll_overflow. That latter thing is coming from the very bottom of nimbase.h. Only the preprocessor is mangling together the value of the last #define and the current expansion together. tcc does not have __builtin_*_overflow anyway and I suspect their lack of definition to be the root cause.

Anyway, while we should still file a tcc bug report, I got Nim+tcc compiles working again by in my ~/.config/nim/nim.cfg setting:

tcc.options.always = "-O -DNIM_EmulateOverflowChecks"

There probably is also a Nim issue to file to ensure that macro is defined in tcc-backend-mode. That let me compile test/FullyAutoMulti.nim in 741 ms with nim c -c and then another 65 ms for the C part. For a bigger test I tried that "hacked together 96 subcommand" test program I mentioned and got 1.41 seconds for the Nim part and 160 ms for the C part or 1.57 seconds total from a cold Nim compile cache. I bet the same would work for you.

To speed you along, after you install tcc in your $PATH somewhere, what I do in my ~/.config/nim/nim.cfg is default cc to tcc and then have a section that makes nim c foo.nim super fast but nim c -d:r foo.nim do maximum run-time fast release mode:

cc = tcc     # Near the top of file; make devel mode spritely
.
.
.
#After all the `hint[Xyz]` stuff
@if r:           #Allow short alias -d:r to activate fast release mode
  define:release
  define:danger
  checks:off
  panics:on
  cc = gcc
@end

c-blake commented 4 years ago

Also, at least with recent Nim-devel builds, I am getting run-time errors with some of my test/ programs that go away if I add --gc:arc. I am unsure if these are Nim bugs or tcc bugs, but I suspect Nim more on that one as I have been noticing with gcc backends that non-arc modes seem to be having more trouble. Anyway, I thought I should mention that you might want to try --gc:arc before giving up on fast tcc compilations.

kaushalmodi commented 4 years ago

I am in the config.nims camp, so I put this in ~/.config/nim/config.nims :)

# https://github.com/c-blake/cligen/issues/139#issuecomment-621883694
switch("tcc.options.always", "-O -DNIM_EmulateOverflowChecks")

.. and now ..

time nim c --cc:tcc src/p4x.nim

4.288u 1.553s 0:06.36 91.6% 0+0k 0+60344io 0pf+0w

The compilation speed is now 2x!

Though, I am seeing hundreds of these warnings:

/tmp/kmodi/.cache/nim/p4x_d/@mp4x.nim.c:129596: warning: assignment discards qualifiers from pointer target type

And then of course the binary is broken too:

Traceback (most recent call last)
/home/kmodi/.nimble/pkgs/cligen-0.9.45/cligen.nim(937) p4x
/home/kmodi/.nimble/pkgs/cligen-0.9.45/cligen/parseopt3.nim(106) lengthen
/home/kmodi/.nimble/pkgs/cligen-0.9.45/cligen/parseopt3.nim(82) optionNormalize
/home/kmodi/usr_local/apps/6/nim/devel/lib/system/gc.nim(439) newObj
/home/kmodi/usr_local/apps/6/nim/devel/lib/system/alloc.nim(786) rawAlloc
SIGSEGV: Illegal storage access. (Attempt to read from nil?)

I am getting run-time errors with some of my test/ programs that go away if I add --gc:arc

I am betting you are seeing the same errors as above. So I also tried the --gc:arc, and while I still get those warnings, I do not get that run time crash any more.. A functional binary compiled in 6 seconds! 🎉

c-blake commented 4 years ago

Oh, and I did get a pretty small reproducer for the tcc preprocessor bug:

#define N_INLINE(rettype, name) inline rettype name
#define NIM_BOOL _Bool
typedef __INT64_TYPE__ NI64;
typedef NI64 NI;
#define nimAddInt(a, b, res) __builtin_saddll_overflow(a, b, (long long int*)res)
static N_INLINE(NIM_BOOL, nimAddInt)(NI a, NI b, NI* res);

which preprocesses to the buggy thing with (long long int*)NI* res under both gcc and tcc. So, maybe it's not a C compiler bug at all, exactly. Maybe it's a dark corner of macro expansion semantics.

kaushalmodi commented 4 years ago

tcc.options.always = "-O -DNIM_EmulateOverflowChecks"

Is this something that should be baked into the default nim.cfg that ships with Nim?

Or is that related to the tcc bug?

c-blake commented 4 years ago

You should silence C compiler warnings with -w. Araq always does that and pushes back against people who say otherwise. And those are similar to, but distinct from the run-time errors I saw. (Inside the Nim gc, illegal access, etc.). When Nim grows new GCs, bug jello gets squished. ;-)

c-blake commented 4 years ago

Well, as per my last small reproducer, I'm no longer sure it's a tcc bug as much as just an unfortunate/weird C macro thing.

Either way, Nim should not have tcc trying to use undefined __builtin_*_overflow.

Hard-wiring -DNIM_EmulateOverflowChecks into tcc options is just a workaround. Nim core should make that happen for anyone using tcc not just you & me.

c-blake commented 4 years ago

( E.g., that one line at the end of nimbase.h just needs changing to have an || defined(__TINYC__) in it..similar to the _MSC_VER handling. )

c-blake commented 4 years ago

Did you want to file a Nim issue for that last bit or do you want me to? The workaround via tcc.options.always should probably always work and not conflict with most things they might do in Nim-core-land.

The non-gc-arc runtime Nim bug could use some minimizing before reporting and may even be more than one bug. Like I said, I haven't been testing with tcc for a couple of months and multiple GCs are in active development.

kaushalmodi commented 4 years ago

You should silence C compiler warnings with -w

I am using stock tcc settings that ship with nim and the extra one that you suggested above. What other switches ~do I need to~ should I set for the tcc compiler/linker? Also how? Can you share your nim.cfg related to tcc tweaks?

kaushalmodi commented 4 years ago

From https://nim-lang.org/blog/2020/04/03/version-120-released.html , I see this:

The Nim compiler now implements a faster way to detect overflows based on GCC’s builtin_sadd_overflow family of functions. (Clang also supports these). Some versions of GCC lack this feature and unfortunately we cannot detect this case reliably. So if you get compilation errors like “undefined reference to builtin_saddll_overflow” compile your programs with -d:nimEmulateOverflowChecks.

So may be this should be the default for tcc in nim.cfg?

kaushalmodi commented 4 years ago

I'll file an issue for tcc + emulateoverflowchecks

c-blake commented 4 years ago

I would say just -w and maybe -O. tcc does not have (nor will it ever have) real optimization or __builtin_*_overflow functions. -O just does a little quick peephole optimization.

In light of tcc never growing support for those gcc/clang __builtin_*, just a simple, static || defined(__TINYC__) in nimbase.h should do the trick. Why, you could even just do a pull request instead of an issue filing. Fixing nimbase.h is definitely better than making everyone define the macro in their tcc.options.

c-blake commented 4 years ago

BTW, if you can really minimize that default-gc-bug the recent gc:orc and gc:arc hacking has introduced then I am very sure Nim core would care and try to fix it. Chances are good that the bug exists on all backends but that it's only manifesting on the tcc backend due to being "optimized away" (or something) on others. Some day a year or two from now it may be just gc:arc and deprecate everything else, but that time is not yet here.

kaushalmodi commented 4 years ago

BTW, if you can really minimize that default-gc-bug the recent gc:orc and gc:arc hacking has introduced then I am very sure Nim core would care and try to fix it.

Can you please take that up?

This was a fun rabbit hole, but need to get out of it and get back to the real world :)

Also, this was the very first time I used --gc:arc

c-blake commented 4 years ago

Sure. No problem.

pb-cdunn commented 4 years ago

I have exactly the same problem. I measure roughly 1s per dispatchable sub-command.

When I look at the top .c, I see that each dispatcher is about 700 lines of code, but it's simple code. I suspect the real problem is that each dispatcher also gets its own closure, which is another 1200 lines of code that might be rough on the optimizer.

To me, the solution is to avoid using closures.

Using tcc is not an option for me where I work. (Not my choice, and we need the excellent optimization of gcc.) I already do debug builds for my own coding, but my teammates are not quite so savvy. (I feel fortunate that they are willing to use Nim at all.) Also, I don't have control over how the unit-tests are built for pull-requests.

My work-around is to use -O0 only for the top-level module, with all the cligen code. Unfortunately, that new feature does not quite work as I think it should yet:

https://github.com/nim-lang/Nim/issues/14194

c-blake commented 4 years ago

This is really more on the C side than the Nim side. It is two levels of code generation from cligen -> Nim -> C and then the last to assembly/machine that's really costly. I'm not actually sure if the issue is closures or what. procs which has only a mere 3 subcommands takes a very long time to do an optimized compile for - 1.43 seconds for the Nim compile but 19 seconds total. There are a few closures elsewhere in procs.nim besides cligen, though not that many. Mostly there are a slew of symbols. There could be several reasons that Nim explodes into a mountain of C that makes the C optimizer grind forever. :-( I think that in general Nim is not really optimized for "backend compile time"..and optimize C compiles are definitely not.

Anyway, my solution has always been to just use tcc for devel even in this case since we figured out to -DNIM_EmulateOverflowChecks. Just raw line count can be a bit misleading as to compiler workload since a lot can be #include'd along the way.

As for your solution, well, in the past I have seen gcc -O0 be "only" 10-20x slower than tcc. So, 20x 0.168 seconds is still only about 3 seconds for my simplified version of @kaushalmodi 's code . So, if you get your localPassC working that might well be "fast enough". If you can't then you may have to set up all the default compiles in nim.cfg to be -O0 and maybe do a few localpasscs to turn on higher levels and have -d:danger (or my above-mentioned -d:r) mode turn on full C optimization in release mode. One way or the other, I bet there's a way to get you to under/around 5s which isn't so bad.

c-blake commented 4 years ago

By the way, mostly for @kaushalmodi who may have independently discovered this, but using gc = markAndSweep seems to work fine with tcc. So, if you couple the default setting with that GC you should be able to get quick development cycle compiles.

c-blake / cligen