Closed kaushalmodi closed 4 years ago
I don't think this is a real issue with cligen
(or maybe even Nim/the Nim VM running the macro machinery). It looks like you have 96 subcommands in the slow case vs 16 in the fast case or 6x as many subcommands. The slow case is only 54/18 = 3x slower, and 54 seconds amortized over almost 100 commands is only half a second per command. So, scaling is already sublinear with number of commands which I think means "drastically" in the issue title is unwarranted.
It makes sense to be a little sublinear since those passthrough commands have no named options. So, most of the big case-of switch is not present as well as the CPU getting more time to be warmed up on a bigger task. I do not think compile time can be drastically improved without losing features. You could definitely reduce the Nim/C code footprint by unroll dispatchMulti
manually to have it "route" to just one leaf dispatcher for 80 commands under different names, but we cannot do that generally or else all those "subcommand tuning" in each []
of dispatchMulti
is lost. Only you in this one application know you can elide all that tuning ability.
You can see how much Nim code must get compiled if you compile with nim c -d:printDispatch
and re-direct to a file. You can also see how much C code has to be built from that giant pile of Nim code by looking a the Nim cache directory. It is not at all clear to me if the (still sublinear!) slowdown comes more from the Nim side or the C side. Having your default no-danger/release mode nim.cfg/config.nims
use gcc -O0
may get you a speed-up, and the amount of that speed-up could inform a bit (or inform more with more sophisticated profiling).
In terms of practical solutions, I suspect with TinyCC
aka tcc
you could get a big speed-up on the C side in non-optimized-edit-compile-test mode, even over gcc -O0
or clang -O0
. I'm not sure if it's something just in my personal nim.cfg/tcc
set up, but tcc
compilation broke for me in Jan/Feb sometime. See also discussion over at https://github.com/c-blake/cligen/issues/137. I would be very interested to hear if you can get tcc
working at all. I used to be able to run my test.sh
which compiles and runs over 50 programs in only 2 seconds with tcc
vs 20 seconds now with gcc -O1/default
. So, end-to-end, Nim+tcc is about 10x faster which is what makes me bias toward a this-is-a-C-compiler performance problem.
Also, nim c -c FullyAutoMulti.nim
takes about 1 second (from a cold Nim compile ~/.cache) while nim c FullyAutoMulti.nim
takes 3.6 seconds. So, the C compiler is 2.6/3.6=72% of the problem..maybe much more for your 100 subcommands case.
I'm not sure there is much else I can say/do except express solidarity/hope that we can get tcc working again (we may need to bug Araq). Nim does do parallel C compiles. It might be possible to lower wall clock by having dispatchMulti
put each generated dispatch in a different Nim translation unit. BUT...A) I'm not sure that would even speed things up, and B) it may well introduce scoping complexity that breaks other things or creates really hairy workarounds and C) definitely would introduce compile-time generation of new .nim
files which is kind of a mess.
I believe there a room for some optimization somewhere because the whole Nim compiler compiles in lesser time than this application for me. 0.5 second for one subcommand seems a lot.
I don't have tcc installed but let me see if I can find the root cause of this slowdown. It's really surprising that the same gcc compiler takes a minute to compile this little CLI app, which compiles whole GNU Emacs in 4-5 minutes.
It doesn't even seem to be a warm-up issue.. I edit one character in the code and recompile, and it still takes 1 minute.
My plan B is that I comment out most of the commands from the dispatchMulti list as I am developing new stuff in this application, but it's of course not very practical.
I'll keep you posted.
If you just guard all those pass through commands with one big when defined(foo)
then it's just nim c -d:foo
which isn't so impractical. You could even flip the logical sense if you tend to forget -d:foo
so you have to -d:pass
to get the pass throughs.
18s is still a while, though. I'd bet with tcc
that could be under/about 2s which is much more fluid in human attention terms.
Also, I think you misunderstood my warm up comment. That was to explain the sub-linear scaling - why 6X more Nim code takes merely 3X more wall time. Your alter-Nim-recompile-test is more about Nim ~/.cache efficacy. BTW - you should probably ensure that directory is on a very fast IO device, e.g. /dev/shm
, via a symlink or via the --nimcache:DIR
Nim compiler option.
And you should definitely try nim c -c
and at least gcc -O0
and clang -O0
to track down Nim vs C compiler vs C optimizer time.
--nimcache:DIR Nim compiler option.
That's a great tip. Let me try that today. By default the cache is in /tmp. Let me see if making that local fixes this.
Well, I just took your ix.io
link and text-editor massaged it into a bunch of proc foo() = discard
with the non-empty param lists for the 10 procs with non-trivial param lists. I get just 1.3 second nim c -c
compiile time from a warm Nim compile ~/.cache. On my machine it takes 20 seconds to get to a binary executable from gcc -g3 -Og
but only 10 seconds with gcc -Og
and then only 5 seconds with gcc -O0
. There is a 91,000 lines of code C file for the main module. I think tcc could compile that easily sub-second even on a laptop and even with the implied .h
include files. I've seen it go at over 4 million lines of code/sec. In short, I am pretty sure this is a C compiler/backend issue and not a Nim/cligen
issue, and I'm also pretty sure that if you got tcc
working you would be sub-2second edit-compile-test even without the when
guards.
Of course, it may take multiple minutes for gcc -O3 to chew on it, but that is a release/benchmark-time-only operation which is hopefully rare.
And running clang -I/usr/lib/nim/lib -w -O0 *.c -o junk
inside ~/.cache/nim/c/prog
took only 3.3 seconds (So, +1.3 =~ 4.6 sec total.)
@pb-cdunn had suggested Zig as a backend. I kinda doubt it'll be very competitive with tcc
, but it might be a little faster than clang -O0
.
I've used tcc
since 2015, though, and someone should really get that working again (unless it's only broken "for me" in which case I should get my local installation working again).
Thanks, I was defaulting to run with -d:release every time. Now doing that and using -O0 should certainly be better :)
I've used tcc since 2015, though, and someone should really get that working again (unless it's only broken "for me" in which case I should get my local installation working again).
Have you filed a big report for that?
Thanks.
Yeah. I bet with -O0 you get back down to 5 second territory or 10x boost with no change to cligen
. :-)
I have not filed a bug report as I'm not certain if it's just "my" tcc
(which I get from the "mob" branch at https://repo.or.cz/tinycc.git) or some more official version stamped one or all that. I strongly suspect it's a Nim issue, but it is definitely work for me to prove that it is. I also unsure what version of tcc
is "officially supported" on x86_64. Also, when it first happened the error message took me straight to some XXX/TODO item. So, I had assumed whoever broke it knew, but that may be mistaken.
If compile-times are getting you down, though, then you should definitely follow up on this. Nothing I have ever seen comes close to tcc
. It's always been at least 10x faster than various -O0 modes. It's "one pass" with an ELF linker built right in.
Why, to get the C error messages "bolded", I patch the program as piping the output to awk
or any other colorizer slows it down by a factor of several (and it does not write a lot of output - much slowdown is just from execvp
of another program!). So, in point of fact, "my tcc" is in some strict unique to me, although I would be stunned if my colorized diagnostics impact any of these failures.
Prompted by all this discussion I looked into the error more closely and the Nim compiler emits C that does not compile on any backend, but only when cc=="tcc"
:
typedef signed long int int64_t;
typedef int64_t NI64;
typedef NI64 NI;
__builtin_saddll_overflow(NI a, NI b, (long long int*)NI* res) { }
In case it doesn't pop out at you, there is some kind of C type cast syntax for res
inside the parameter list. This is probably specific enough to file an issue for. nim c
shouldn't generate illegal C like that for any of its backend-compiler branches.
I am certainly lost when I read that code because I have only learnt a little bit more C after I picked up Nim.
A quick update: not using -d:release
certainly helped even without removing those passthrough procs.
Below times are after deleting the .cache
dirs for this project.
time nim c src/p4x.nim
13.418u 6.403s 0:13.37 148.1% 0+0k 6240+104120io 0pf+0w
time nim -d:release c src/p4x.nim
60.441u 6.477s 0:51.63 129.5% 0+0k 4152+116104io 0pf+0w
I cloned and installed tcc from its latest master (_tcc version 0.9.27 (x8664 Linux)), but it failed as you have seen.
> time nim --cc:tcc c src/p4x.nim
CC: stdlib_assertions.nim
CC: stdlib_dollars.nim
CC: stdlib_formatfloat.nim
CC: stdlib_io.nim
CC: stdlib_system.nim
CC: stdlib_parseutils.nim
CC: stdlib_math.nim
CC: stdlib_algorithm.nim
/tmp/kmodi/.cache/nim/p4x_d/stdlib_parseutils.nim.c:222: error: invalid type
/tmp/kmodi/.cache/nim/p4x_d/stdlib_io.nim.c:256: error: invalid type
CC: stdlib_unicode.nim
/tmp/kmodi/.cache/nim/p4x_d/stdlib_math.nim.c:48: error: invalid type
/tmp/kmodi/.cache/nim/p4x_d/stdlib_system.nim.c:644: error: invalid type
CC: stdlib_strutils.nim
/tmp/kmodi/.cache/nim/p4x_d/stdlib_algorithm.nim.c:142: error: invalid type
CC: stdlib_pathnorm.nim
/tmp/kmodi/.cache/nim/p4x_d/stdlib_unicode.nim.c:98: error: invalid type
/tmp/kmodi/.cache/nim/p4x_d/stdlib_strutils.nim.c:228: error: invalid type
Error: execution of an external compiler program 'tcc -c -w -I/home/kmodi/usr_local/apps/6/nim/devel/lib -I/home/kmodi/sandbox/nim/p4x/src -o /tmp/kmodi/.cache/nim/p4x_d/stdlib_io.nim.c.o /tmp/kmodi/.cache/nim/p4x_d/stdlib_io.nim.c' failed with exit code: 1
/tmp/kmodi/.cache/nim/p4x_d/stdlib_pathnorm.nim.c:68: error: invalid type
Here is that stdlib_pathnorm.nim.c
.
Actually, the above code is expanded (and trimmed) tcc -E
output. The C source code looks like:
static N_INLINE(NIM_BOOL, nimAddInt)(NI a, NI b, NI* res);
So, now I suspect it's likely a bug in the tcc
C preprocessor, not a bug in Nim.
Yeah. It fails on a Nim file with just echo 0
as source. :-) If you want a simpler reproduction that only has 8,000 lines of C code.
Oh well, not using -d:release
during development certainly works for me. Thanks for that tip.
Yeah. It fails on a Nim file with just echo 0 as source.
You should definitely file a bug report so that this is tracked for nim+tcc users. That would also serve as something to point to to the tcc devs.
But THANK YOU for cloning tcc/jumping in. With my echo 0
test the failure is only at line 53 of the stdlib_io.nim.c
. Stepping through tcc
in gdb
might even work to track down why tcc -E
wedges a cast in there erroneously. If we get this working we can get your now 13.4 seconds below 2 seconds.
tcc
is just about the simplest/cleanest compiler source code I've ever read. If you ever wanted to write your own compiler all the way to machine code, it would not be a bad starting point. I may do better becoming "part of the angry mob" of tcc
hackers with a proposed fix than with just a bug report. ;-) Anyway, uncertainty if the issue was with Nim or tcc also held me back from filing a Nim github issue, and in this case the uncertainty has been resolved in favor of my laziness. ;-)
In fact, just taking that snippet I posted up in https://github.com/c-blake/cligen/issues/139#issuecomment-621813940 and feeding it to tcc -E
reproduces the erroneous (long long int*)NI* res
. So, that's a pretty minimal reproducing bug!
Feel free to ping here.. I'd be happy to try out the tcc patch that fixes this.
Well, I misspoke about the tcc
bug reproducer (I accidentally cut & paste the bug itself -- oops!), but
#define NIM_INTBITS 64
#include "nimbase.h"
static N_INLINE(NIM_BOOL, nimAddInt)(NI a, NI b, NI* res);
does reproduce the tcc -E
problem with only like 300 lines of output. Investigating further.
Aha...Maybe you had already noticed, but the function name is also wrong. It should be nimAddInt
not __builtin_saddll_overflow
. That latter thing is coming from the very bottom of nimbase.h
. Only the preprocessor is mangling together the value of the last #define
and the current expansion together. tcc
does not have __builtin_*_overflow
anyway and I suspect their lack of definition to be the root cause.
Anyway, while we should still file a tcc
bug report, I got Nim+tcc compiles working again by in my ~/.config/nim/nim.cfg
setting:
tcc.options.always = "-O -DNIM_EmulateOverflowChecks"
There probably is also a Nim issue to file to ensure that macro is defined in tcc-backend-mode. That let me compile test/FullyAutoMulti.nim in 741 ms with nim c -c
and then another 65 ms for the C part. For a bigger test I tried that "hacked together 96 subcommand" test program I mentioned and got 1.41 seconds for the Nim part and 160 ms for the C part or 1.57 seconds total from a cold Nim compile cache. I bet the same would work for you.
To speed you along, after you install tcc
in your $PATH
somewhere, what I do in my ~/.config/nim/nim.cfg
is default cc
to tcc
and then have a section that makes nim c foo.nim
super fast but nim c -d:r foo.nim
do maximum run-time fast release mode:
cc = tcc # Near the top of file; make devel mode spritely
.
.
.
#After all the `hint[Xyz]` stuff
@if r: #Allow short alias -d:r to activate fast release mode
define:release
define:danger
checks:off
panics:on
cc = gcc
@end
Also, at least with recent Nim-devel builds, I am getting run-time errors with some of my test/
programs that go away if I add --gc:arc
. I am unsure if these are Nim bugs or tcc
bugs, but I suspect Nim more on that one as I have been noticing with gcc
backends that non-arc modes seem to be having more trouble. Anyway, I thought I should mention that you might want to try --gc:arc
before giving up on fast tcc
compilations.
I am in the config.nims
camp, so I put this in ~/.config/nim/config.nims
:)
# https://github.com/c-blake/cligen/issues/139#issuecomment-621883694
switch("tcc.options.always", "-O -DNIM_EmulateOverflowChecks")
.. and now ..
time nim c --cc:tcc src/p4x.nim
4.288u 1.553s 0:06.36 91.6% 0+0k 0+60344io 0pf+0w
The compilation speed is now 2x!
Though, I am seeing hundreds of these warnings:
/tmp/kmodi/.cache/nim/p4x_d/@mp4x.nim.c:129596: warning: assignment discards qualifiers from pointer target type
And then of course the binary is broken too:
Traceback (most recent call last)
/home/kmodi/.nimble/pkgs/cligen-0.9.45/cligen.nim(937) p4x
/home/kmodi/.nimble/pkgs/cligen-0.9.45/cligen/parseopt3.nim(106) lengthen
/home/kmodi/.nimble/pkgs/cligen-0.9.45/cligen/parseopt3.nim(82) optionNormalize
/home/kmodi/usr_local/apps/6/nim/devel/lib/system/gc.nim(439) newObj
/home/kmodi/usr_local/apps/6/nim/devel/lib/system/alloc.nim(786) rawAlloc
SIGSEGV: Illegal storage access. (Attempt to read from nil?)
I am getting run-time errors with some of my test/ programs that go away if I add --gc:arc
I am betting you are seeing the same errors as above. So I also tried the --gc:arc
, and while I still get those warnings, I do not get that run time crash any more.. A functional binary compiled in 6 seconds! 🎉
Oh, and I did get a pretty small reproducer for the tcc
preprocessor bug:
#define N_INLINE(rettype, name) inline rettype name
#define NIM_BOOL _Bool
typedef __INT64_TYPE__ NI64;
typedef NI64 NI;
#define nimAddInt(a, b, res) __builtin_saddll_overflow(a, b, (long long int*)res)
static N_INLINE(NIM_BOOL, nimAddInt)(NI a, NI b, NI* res);
which preprocesses to the buggy thing with (long long int*)NI* res
under both gcc
and tcc
. So, maybe it's not a C compiler bug at all, exactly. Maybe it's a dark corner of macro expansion semantics.
tcc.options.always = "-O -DNIM_EmulateOverflowChecks"
Is this something that should be baked into the default nim.cfg that ships with Nim?
Or is that related to the tcc bug?
You should silence C compiler warnings with -w
. Araq always does that and pushes back against people who say otherwise. And those are similar to, but distinct from the run-time errors I saw. (Inside the Nim gc, illegal access, etc.). When Nim grows new GCs, bug jello gets squished. ;-)
Well, as per my last small reproducer, I'm no longer sure it's a tcc bug as much as just an unfortunate/weird C macro thing.
Either way, Nim should not have tcc
trying to use undefined __builtin_*_overflow
.
Hard-wiring -DNIM_EmulateOverflowChecks
into tcc
options is just a workaround. Nim core should make that happen for anyone using tcc
not just you & me.
( E.g., that one line at the end of nimbase.h
just needs changing to have an || defined(__TINYC__)
in it..similar to the _MSC_VER
handling. )
Did you want to file a Nim issue for that last bit or do you want me to? The workaround via tcc.options.always
should probably always work and not conflict with most things they might do in Nim-core-land.
The non-gc-arc runtime Nim bug could use some minimizing before reporting and may even be more than one bug. Like I said, I haven't been testing with tcc
for a couple of months and multiple GCs are in active development.
You should silence C compiler warnings with -w
I am using stock tcc settings that ship with nim and the extra one that you suggested above. What other switches ~do I need to~ should I set for the tcc compiler/linker? Also how? Can you share your nim.cfg related to tcc tweaks?
From https://nim-lang.org/blog/2020/04/03/version-120-released.html , I see this:
The Nim compiler now implements a faster way to detect overflows based on GCC’s builtin_sadd_overflow family of functions. (Clang also supports these). Some versions of GCC lack this feature and unfortunately we cannot detect this case reliably. So if you get compilation errors like “undefined reference to builtin_saddll_overflow” compile your programs with -d:nimEmulateOverflowChecks.
So may be this should be the default for tcc in nim.cfg?
I'll file an issue for tcc + emulateoverflowchecks
I would say just -w
and maybe -O
. tcc
does not have (nor will it ever have) real optimization or __builtin_*_overflow
functions. -O
just does a little quick peephole optimization.
In light of tcc
never growing support for those gcc/clang __builtin_*
, just a simple, static || defined(__TINYC__)
in nimbase.h
should do the trick. Why, you could even just do a pull request instead of an issue filing. Fixing nimbase.h
is definitely better than making everyone define the macro in their tcc.options
.
BTW, if you can really minimize that default-gc-bug the recent gc:orc
and gc:arc
hacking has introduced then I am very sure Nim core would care and try to fix it. Chances are good that the bug exists on all backends but that it's only manifesting on the tcc
backend due to being "optimized away" (or something) on others. Some day a year or two from now it may be just gc:arc
and deprecate everything else, but that time is not yet here.
BTW, if you can really minimize that default-gc-bug the recent gc:orc and gc:arc hacking has introduced then I am very sure Nim core would care and try to fix it.
Can you please take that up?
This was a fun rabbit hole, but need to get out of it and get back to the real world :)
Also, this was the very first time I used --gc:arc
Sure. No problem.
I have exactly the same problem. I measure roughly 1s per dispatchable sub-command.
When I look at the top .c
, I see that each dispatcher is about 700 lines of code, but it's simple code. I suspect the real problem is that each dispatcher also gets its own closure, which is another 1200 lines of code that might be rough on the optimizer.
To me, the solution is to avoid using closures.
Using tcc
is not an option for me where I work. (Not my choice, and we need the excellent optimization of gcc
.) I already do debug builds for my own coding, but my teammates are not quite so savvy. (I feel fortunate that they are willing to use Nim at all.) Also, I don't have control over how the unit-tests are built for pull-requests.
My work-around is to use -O0
only for the top-level module, with all the cligen code. Unfortunately, that new feature does not quite work as I think it should yet:
This is really more on the C side than the Nim side. It is two levels of code generation from cligen -> Nim -> C and then the last to assembly/machine that's really costly. I'm not actually sure if the issue is closures or what. procs
which has only a mere 3 subcommands takes a very long time to do an optimized compile for - 1.43 seconds for the Nim compile but 19 seconds total. There are a few closures elsewhere in procs.nim
besides cligen
, though not that many. Mostly there are a slew of symbols. There could be several reasons that Nim explodes into a mountain of C that makes the C optimizer grind forever. :-( I think that in general Nim is not really optimized for "backend compile time"..and optimize C compiles are definitely not.
Anyway, my solution has always been to just use tcc
for devel even in this case since we figured out to -DNIM_EmulateOverflowChecks
. Just raw line count can be a bit misleading as to compiler workload since a lot can be #include
'd along the way.
As for your solution, well, in the past I have seen gcc -O0
be "only" 10-20x slower than tcc
. So, 20x 0.168 seconds is still only about 3 seconds for my simplified version of @kaushalmodi 's code . So, if you get your localPassC
working that might well be "fast enough". If you can't then you may have to set up all the default compiles in nim.cfg to be -O0
and maybe do a few localpassc
s to turn on higher levels and have -d:danger
(or my above-mentioned -d:r
) mode turn on full C optimization in release mode. One way or the other, I bet there's a way to get you to under/around 5s which isn't so bad.
By the way, mostly for @kaushalmodi who may have independently discovered this, but using gc = markAndSweep
seems to work fine with tcc
. So, if you couple the default setting with that GC you should be able to get quick development cycle compiles.
Hello,
I happen to have a really large
dispatchMulti
: http://ix.io/2k7q/nimIn that snippet, if I do not have the subcommands after "# Commands passed directly to p4", my compilation time is 18 seconds. But with all those lines after that point, the time goes up to 54 seconds.
Each of those function looks like this (calling that same template but with a different
subcommand
string:The compile time bump due to the subcommands added to
dispatchMulti
is obvious because if I comment out those lines after "# Commands passed directly to p4", time goes down from 54s to 18s.Versions
SOLUTION
It was just that I was always building with
-d:release
(which happened automatically vianimble install
). If I simply donim c foo.nim
, the whole project took 13.37s only.