Improvements to `strUt.tmplParse`

SirNickolas commented 12 months ago

tmplParse is a brilliant: its syntax is so flexible yet can be parsed with a few memchr calls and an ident-skipping loop. Unfortunately, current implementation is suboptimal and even buggy. I tried to fix/improve it:

${(id)arg} syntax did not work at all because a1 was never assigned in the corresponding branch. If you tried to access fmt[arg], you got a Defect. Fixed.
${(}a)b} was parsed as a call with id=}a, arg=garbage, call=${(}, followed by text a)b}. That behaviour feels weird to me. I think these three options are more sensible:
1. id=}a, arg=b, call=${(}a)b}. I.e., the first } is not treated as a closer for the whole call.
2. id=(, arg=empty, call=${(}; followed by text a)b}. I.e., the first } does close the call, and when the parser fails to find ) inside braces, it treats the whole braced fragment as an identifier.
3. No macro calls; text ${(}a)b}. I.e., when the parser sees this gibberish, it gives up, emits it as plain text, and moves on to the next $.
The first option goes against fmt writers must pick non-colliding braces rule. Given that it is enforced everywhere else, I doubt it makes much sense to loosen it here. Furthermore, a simple mistake (missing a )) can render the rest of the fmt invalid, with all macro calls it might contain.

Other options have none of the abovementioned flaws. I chose the third since it felt more natural to me; also, this is what current implementation does when there are no more ) characters in the string (as opposed to the case when there is one but beyond braces). Please tell if you’d like to see option 2 implemented.
Another change in semantics: parsing ${(a} $b was resulting in only text chunks. Now it recognizes the $b macro call.
${()} produced a macro call with an empty id, while ${} was parsed as text. (That’s because i0 == c1 in line 437 is always false.) Fixed.
The iterator does not yield empty text chunks anymore. They are hardly useful for users — and sometimes harmful (e.g., when they are pushed to a seq[string]). Not having to check for arg.len != 0 is more convenient.
tmplParse is an inline iterator so loop body passed to it gets inserted at every yield statement. That causes code bloating. Current implementation has 4 yields; I restructured the code so that there are only 2 of them. (Reducing to 1 is not necessary; I’ll explain in a moment.)

As a side effect, it slightly changed the way meta char’s self-escaping is handled: previously $ was emitted on its own, and now it becomes part of the following text piece (if there is one). I actually think that auto-merging (at least some) text chunks is an improvement: it leaves less work for the renderer.
Current implementation yields either (0..0, c1..<c0, 0..0) or (i0..<i1, a0..<a1, c0..<c1), and renderers are supposed to check for id.a == 0 to see whether they are given a literal text chunk or a macro call. In the former case it is obvious that (0..0).a == 0 — the compiler is able to do constant propagation and leave only one branch. However, in the latter it fails to prove that i0 != 0 in general. Therefore, it emits a runtime check and is unable to eliminate dead code in the other branch (therefore, more code bloating).

This optimization is performed by the backend compiler. I tried communicating i0 != 0 via assert, but it is dropped in the Nim land and the backend compiler does not even see those assertions. While it is possible to employ __builtin_unreachable()/__assume(cond) for GCC/MSVC, things get harder for the JS backend. So I went with a different approach.

I introduced iterator tmplParse2* (its name is debatable), which yields 4-element tuples: (lit, id, arg, call). lit is a bool that is true for text chunks and false for macro calls. Now inserting the correct branch in the correct yield position is trivial for the compiler.

I was considering making an enum but thought it isn’t better at anything. Both changing bool into enum and adding a new enum member are breaking changes (the latter — because case statements in user code will become non-exhaustive).

I’ve been writing this with a disassembler in my other hand (GCC & Clang) so tiny replacements, such as < 0 vs == -1, are no accident. (< 0 compiles down to test & js or sometimes even sub & js while == -1 is unable to offer anything better than cmp & je.)

In the doc comment, definition lists are (mis)used to indent list items. I did not manage to find a way of making a definition span multiple lines so I had to keep a line longer than 80 characters.

For some reason, nim doc, when generating HTML anchors, ignores proc parameters without an explicit type annotation when they have a default value of a const. Probably a bug in the docgen. That affects tmplParse and tmplParse2 because of their ids=alphaNum parameter. I put #tmplParse2.i,openArray[char],char anchor in the doc (incorrect but currently working). Another option is to explicitly declare ids to be set[char] — the anchor will be #tmplParse2.i,openArray[char],char,set[char], as expected.

Will be glad to hear your opinion on the proposed changes.

c-blake commented 12 months ago

Just as a quick acknowledgement, I am looking at this. It is a lot to consider/go through. I'm glad you like it, though and very much appreciate all this work! For personal reasons, it may be another 24 hours or more before I can really decide how to proceed/offer better feedback/etc.

2 yields is definitely better than 4. If you look at the VC history you will see I originally always just returned a seq..So, those yields were adds. I do worry a little about over-tuning micro-optimizations that may change next version or 3 of gcc/clang, and of course real time and instruction complexity are only vaguely related. So, we may need some benchmark on >1 CPUs to be more confident. Anyway, I'm still digesting it all, but wanted to post an initial reply.

c-blake commented 12 months ago

Oh, one other thing - I have a couple of use cases of this code in bu/tmpls.nim and bu/mk1.nim. A modified bu/tmpls in particular might be an ok start for a benchmark on the rendering side.

For the parsing side we need monster format strings. An earlier incarnation of this code in C (a response to a friend of mine raving about Text Template in Perl) was applied to fairly large HTML strings with embedded macros. It's pretty easy to find some large web pages and drop a few dozen randomly placed macro calls for that kind of measurement... Both sides might make your ideas more empirical and less theoretical.

On the more theoretical side, it's worth observing, though I suspect you get it already, that this template "language" is kind of "almost a Lisp" but with a "warning character" (old macro processor term). All you really need to add is a macro to define others. I find % or $ especially color-syntax highlighted a bit easier to read, plus you get the "whatever bracket" easily & fast parsing. The preprocessed seq[MacroCall] is basically "byte code"; the case statement an interpreter.

It's actually very nice to have someone else looking at the impl & finding my bugs / oversights. I just have an electrical short to deal with.

SirNickolas commented 12 months ago

A modified bu/tmpls in particular might be an ok start for a benchmark on the rendering side.

Oh, when I wrote renderer above, I actually meant, caller of the iterator. In a real-world case, it will be responsible for preprocessing the template, not for rendering it. So this PR does not affect rendering. As soon as you get a seq[MacroCall], there will be no difference. For improving rendering speed of bu/tmpls, I’d suggest to…

Concatenate templates, put term after each (including the last), and parse it as a single template. Upd.: If term == meta, you need to double it;
Instead of MacroCall, use a specialized bytecode representation:
```
type
  OpCode = enum opLit, opString, opNeedQuoted, opQuoted, opEscaped
  Instr = (OpCode, Slice[int]) # Slice is used only for lit.
```
It’ll avoid access to tmpl[id.a] for checking the op code, and it uses twice as little memory (but the last point is only relevant if there are thousands of interpolations). And if we go crazy and allow variable-length instructions, we can pack non-lits to a single byte — that’s what std/times do, by the way.

It's pretty easy to find some large web pages and drop a few dozen randomly placed macro calls for that kind of measurement...

I’ll see what I can do.

Yes, it’s a mini-VM for a mini-language. I’d say, a closer parallel is Tcl because Lisps do lexing & parsing once and manipulate ASTs, not strings, from that moment. At least, they are able to.

SirNickolas commented 12 months ago

In the large HTML fixture, we are measuring memchr, not our code, so impact is negligible.

I generated a file mostly (about 90%) consisting of interpolations and $$ escapes. In such pathological case, new implementation gives a \~5% speedup. The code being benchmarked:

import cligen/strUt
when not declared File:
  import std/syncio

proc main =
  let fmt = stdin.readAll
  var stats = 0

  template handle(lit: bool; ident, arg, call: Slice[int]) =
    stats += (if lit: arg.len else: ident.len + call.len shl 2)

  when declared tmplParse2:
    for lit, ident, arg, call in tmplParse2 fmt:
      handle lit, ident, arg, call
  else:
    for ident, arg, call in tmplParse fmt:
      handle ident.a == 0, ident, arg, call

  echo stats

when isMainModule:
  main()

c-blake commented 12 months ago

This is maybe an obvious point, but besides testing id == 0..0, it should be equally valid to test id.b == 0. { .b==0 is impossible since the warning character takes up at least 1 byte, at least for a non-empty format; We could probably not loop at all for an empty format; ust change while true: to while eos > 0:. } Did you try id.b == 0? It might be "simpler enough" to generate much closer code to your lit: bool. {EDIT: I see you did ident.a == 0 instead which is the same idea - sorry! }

Also, gcc PGO is pretty good at measuring a variety of dynamic code metrics and tends to obsolete things like unlikely for builds which are performance-sensitive. It's probably worth benchmarking a PGO build before changing the API (I have a nim-pgo script in my nimsearch repo. In a perfect world, trying that would be on ARM, Intel, AMD, but whatever you have is fine. I have seen speed-up ratios like 2X with PGO, dwarfing this 5% thing. OTOH, it's partly art not all science and I have also seen some weird "inversions" where a change like yours that makes it faster unmeasured/non-PGO might make it slower PGO.

I am working on a version of this change that does everything as you did except the API change (and explains the syntax less verbosely without :literal: all over). If we post some code/data/data generators somewhere, I could probably measure all 3 versions - old/buggy, new 0..0, new bool on at least 2..3 CPUs. If we really only see a 5% boost best case, I'm probably going to say it's not worth an API change. In my experience, deltas that small just do not survive across variant CPUs & compilers.

But I really don't want to diminish your work - Huge kudos to the side-by-side generated code/editing work! Very good form & all too rare. I do think you will probably find with all the inlining/expansions/CSE/blah/blah that it is much harder to do that with PGO in the mix. Maybe not impossible, but maybe hard enough to not feel as worth it. :-) PGO does often pay off big on Nim code, though. Yardanico was just speeding up the Nim compiler itself quite a bit with some PGO builds.

c-blake commented 12 months ago

I should also have said it is worthwhile having the non-PGO code be better - with the 2 yields & conditionals and all that. I just feel like a whole benchmark PGO ratio is likely to overwhelm the 5% type thing or whatever from the bool bit.

So, if perf matters to someone they can already get more out of that than the API change, and it's harder to forecast all that, especially if it's PGO --march=native with the, oh, 20 or so x86_64 variations floating around in the world. So, the API complexity/disruption may just not be worth it.

c-blake commented 12 months ago

Ok. So, here's what I did. (NOTE: /t is a symlnk -> /tmp which is a bind-mount to /dev/shm on all my systems). To get some input I did: cd /usr/src/linux; find . -name '*.c' -print|sed 's/.c$//' > /t/input and copied that same file to 2 other systems.

Then I tried 4 versions of bu/tmpls.nim - the original id==0..0 called tmpls00 below, one with that test replaced with id.b==0 (NOTE: not the same as id.len==0), tmpls-newCB with all your wonderful bug-fixes/yield reductions but the same API, and tmpls2 - your new API augmented with a tmplsParsed2 and the test changed to just lit extracted from the tuple. I built with nim-pgo which uses march=native with the same input & command as training data for the compiler. I used my bu/tim to assess:

i7-6700k$ tim './tmpls00 "%s.c %s.o %n.c %e.o"</t/input>/n' './tmpls0 "%s.c %s.o %n.c %e.o"</t/input>/n' './tmpls2 "%s.c %s.o %n.c %e.o"</t/input>/n' './tmpls-newCB "%s.c %s.o %n.c %e.o"</t/input>/n'
(8.941 +- 0.030)e-03    ./tmpls00 "%s.c %s.o %n.c %e.o"</t/input>/n
(8.1581 +- 0.0054)e-03  ./tmpls0 "%s.c %s.o %n.c %e.o"</t/input>/n
(8.360 +- 0.030)e-03    ./tmpls2 "%s.c %s.o %n.c %e.o"</t/input>/n
(8.2608 +- 0.0086)e-03  ./tmpls-newCB "%s.c %s.o %n.c %e.o"</t/input>/n

2950X$ tim './tmpls00 "%s.c %s.o %n.c %e.o"</t/input>/n' './tmpls0 "%s.c %s.o %n.c %e.o"</t/input>/n' './tmpls2 "%s.c %s.o %n.c %e.o"</t/input>/n' './tmpls-newCB "%s.c %s.o %n.c %e.o"</t/input>/n'
0.010568 +- 0.000070    ./tmpls00 "%s.c %s.o %n.c %e.o"</t/input>/n
(9.544 +- 0.016)e-03    ./tmpls0 "%s.c %s.o %n.c %e.o"</t/input>/n
(9.966 +- 0.052)e-03    ./tmpls2 "%s.c %s.o %n.c %e.o"</t/input>/n
0.01028951 +- 0.00000049        ./tmpls-newCB "%s.c %s.o %n.c %e.o"</t/input>/n

i5-M540$ tim './tmpls00 "%s.c %s.o %n.c %e.o"</t/input>/n' './tmpls0 "%s.c %s.o
%n.c %e.o"</t/input>/n' './tmpls2 "%s.c %s.o %n.c %e.o"</t/input>/n' './tmpls-newCB "%s.c %s.o %n.c %e.o"</t/input>/n'
0.020269 +- 0.000093    ./tmpls00 "%s.c %s.o %n.c %e.o"</t/input>/n
0.01799 +- 0.00011      ./tmpls0 "%s.c %s.o %n.c %e.o"</t/input>/n
0.017855 +- 0.000040    ./tmpls2 "%s.c %s.o %n.c %e.o"</t/input>/n
0.0179961 +- 0.0000036  ./tmpls-newCB "%s.c %s.o %n.c %e.o"</t/input>/n

I don't know if you ever learned or remember your uncertainty calculus, but basically things subtract with Pythagorean Triangle rules and less than 5..10 sigma is not really statistically significant (that is a rule of thumb given hostile system timing noise - the distribution is neither really known nor stationary, but tim does a reproducibility sanity check).

So, INTERPRETATION: Just 1 benchmark (same versions of Nim-devel gcc & glibc, btw) and 1 just data set, but what I see is that the only >5 sigma perf difference between all 4 versions is that id == 0..0 does indeed suck (a little). Even over 13.5+-.6% off (>22 sigma, for tmpls00 vs tmpls2) on that very old, very wimpy i5-M540. Essentially all other differences are within the noise of warm-up of CPU caches/branch predictors/etc. Reboot, kill your web browser, go to single-user mode, unplug your network cable, and things tend to re-arrange.

In terms of other benchmarks, doing the least amount of work per macro call, such as just counting invocations to print only a histogram of call counts would put dispatch cost in greatest focus. The measurement errors I got were small enough to make that seem unnecessary. There is also value in measuring a "more real/end-to-end" case. I usetmpls|mk1 in real things already, for example, although the true costs there are dominated by stat system calls/work (even when optimized by my sys_batch).

Also, notice that my id.b==0 way is slightly (but not significantly) better than your lit way on new Intel but not old Intel or new-ish AMD.

It seems like a rare case to have a ginormous format string used only once and as you say memchr saves us there with non-pathological formats. Re-use over and over again applied to novel data seems the most likely performance-sensitive context. Now, I cannot be sure that every performance-sensitive user will pre-parse a saved answer into a seq instead of using the iterator directly or that they will use PGO builds, but both are simple and even generally good optimization advices for not just this problem or Nim code, but most coding.

Based on this (you may get other results), changing the API seems unwarranted. Changing the example code/docs to use id.b==0 and adding a perfectly predictable while true > 0: -> while fmt.len > 0: to ensure correctness of that change does seem warranted (I should probably time that variant, too). What do you think? I want to give you as much git-blame credit as possible for all your great bug-fixing/simplification. So, I might just merge your changes and then merge mine reverting the API change/tweaking doc comments on top of them.

c-blake commented 12 months ago

Yeah - while eos > 0: makes no noticable difference.. well within the run-to-run fluctuations of tim runs. tim output was (8.2964 +- 0.0080)e-03 ./tmpls-newCB2 "%s.c %s.o %n.c %e.o"</t/input>/n just for posterity, but bounced around from 8.266 to 8.371. This could, of course, all be re-measured with a pure iterator dispatcher which I think was your motivating case { EDIT: and as you explained already! :-) }.

c-blake commented 12 months ago

Also, I have zero objection to some template isLit(id): bool to make id.b==0 be "spelled" id.isLit (or even isLiteral). It probably should always have been so. E.g., were it so before all consumers of this would get the little speed boost from this id==0..0 -> id.b==0 transition. Ah well.

c-blake commented 12 months ago

As another example measurement, if I do: tmpls "%%s.c %%s.o %%n.c %%e.o" < input > in2 then I can generate a lot of input for the program you posted. If I do PGO builds of both the unpatched cligen & your 224-patched cligen then I get 1.377 +- 0.010 vs. 1.3729 +- 0.0056 (milliseconds). That's only a (0.3 +- 0.8)% delta on that i7-6700k. Meanwhile, even without PGO I see 1.3862 +- 0.0073 vs. 1.374 +- 0.012 or (0.9+- 0.10)%. So, at least on that CPU / this adapted second benchmark, there is not really a (wall time) advantage to the new API either way (and PGO is not buying us much either). Obviously in bigger code contexts, CPU I-cache pressure & everything makes stuff nearly impossible to forecast.

On that i5-M540, I got old cligen 1.922 +- 0.012, sirnick-224 1.863 +- 0.021, and sirnick-224-PGO 2.0355 +- 0.0064 which is an example of PGO failing us - it's not a panacea. Anyway, the first delta is still only (3.2 +- 1.3)%.

So, while I do get the assembly might be cleaner, I think modern CPUs almost entirely "wallpaper over" the bad-lookingness in this case.

c-blake commented 12 months ago

As another example measurement, if I do: tmpls "%%s.c %%s.o %%n.c %%e.o" < input > in2 then I can generate a lot of input for the program you posted. If I do PGO builds of both the unpatched cligen & your 224-patched cligen then I get 1.377 +- 0.010 vs. 1.3729 +- 0.0056 (milliseconds). That's only a (0.3 +- 0.8)% delta on that i7-6700k. Meanwhile, even without PGO I see 1.3862 +- 0.0073 vs. 1.374 +- 0.012 or (0.9+- 0.10)%. So, at least on that CPU / this adapted second benchmark, there is not really a (wall time) advantage to the new API either way (and PGO is not buying us much either). Obviously in bigger code contexts, CPU I-cache pressure & everything makes stuff nearly impossible to forecast.

On that i5-M540, I got old cligen 1.922 +- 0.012, sirnick-224 1.863 +- 0.021, and sirnick-224-PGO 2.741 +- 0.011 which is an example of PGO failing us - it's not a panacea. Anyway, the first delta is still only (3.2 +- 1.3)% and does not have run-to-run consistency upon re-runs of the whole outer tim command.

So, while I do get the assembly might be cleaner, I think modern CPUs almost entirely "wallpaper over" the bad-lookingness of the old API in this case.

SirNickolas commented 12 months ago

That was an impressive work of investigating and explaining all that in details! I feel this humbling patch hardly deserved that. Thank you very much. A separate thanks for the tim link.

You showed clearly how small the gap is on various hardware. I have nothing to add. Oh, and I really should try PGO out to learn when a program can benefit from it.

and adding a perfectly predictable while true > 0: -> while fmt.len > 0: to ensure correctness of that change does seem warranted

Checking fmt.len is not required for correctness; it just adds a fast path for an empty string, which already incurs almost no work. I think it is worth adding such test in examples/tmpl.nim.

(and explains the syntax less verbosely without :literal: all over)

I noticed nimdoc tried to highlight Nim syntax in regular backquotes in its HTML output. That resulted in $<> being colorized and ()[]{} not. <Needed> with colorful brackets but [Optional] without looked inconsistent so I put :literal: on all of them. I agree it made the rST source harder to read; feel free to drop them.

Also, I have zero objection to some template isLit(id): bool to make id.b==0 be "spelled" id.isLit (or even isLiteral).

Checking id.isLit[eral] and then using arg might look counterintuitive, but, well, id.b == 0 requires knowledge of the API as well. I’m fine with either decision.

c-blake commented 12 months ago

Ok. Well, I just did what I said my plan was to do { though in a few more commits than it probably should have taken. ;-) }

I named it idIsLiteral in light of your counter-intuitiveness comment. I mean, it could be idImpliesArgLiteral or idMeansArgIsLiteral, but all that is kind of a mouthful. It could be a proc defined on a MacroCall, but then you aren't tuple-unpacking as much (or forcing a more awkward 2-stage set to tuple then unpack mode on the caller). I think the new thing is at least better than the old thing. I'm not sure a perfect thing here really exists without assuming the user at least looks at examples/the documentation (which is not a rare situation in software). I guess we could put the literal into the last call slot (or even both slots) of MacroCall.

I also added an "" test. I updated all my use cases to use the new idIsLiteral. I have some changes set to go for my bu/ package, but I probably want to release a new cligen version before I push those so the build dependency can be clean.

c-blake commented 12 months ago

BTW, while I have tested tim on a few machines, it is possible that my background noise (from background kernel/demon/network activity) is either less or greater than your own. So, as you start to use it on your own systems you should probably "calibrate it" in terms of your N sigma expectations and run-to-run consistency. Mostly the aim of that tiny program is just to sanity check run-to-run consistency which seems..a rarely taken step in the modern era, at least in internet chatter (and too often scientific articles on software).

BTW, you may | may not have noticed, but because we use find (which only becomes memchr on C backends) that lets us pre-process/parse the whole format string right at compile-time. I use this fact in fmtUncertainRender. So, if a caller just goes with defaults they only have dispatch time, not parsing time. :-)

c-blake / cligen

Improvements to `strUt.tmplParse` #224