attempt at optimizing allocs in replace.c

lefessan commented 3 months ago

There are two steps to remove allocations:

use a circular buffer for the list of tokens
remove re-allocation of tokens passing from one stream to another

lefessan commented 3 months ago

This version passes the testsuite on my computer.

@GitMensch You may want to test it on your example to measure the number of allocations. It's supposed to allocate less than the original implementation in gnucobol 3.1, as there is no allocation of intermediate lists.

GitMensch commented 3 months ago

related to #75

GitMensch commented 3 months ago

This looks quite promising. perf stat, first for 3.2, then for 3.2 + this change:

          2,369.99 msec task-clock:u                     #    0.937 CPUs utilized
                 0      context-switches:u               #    0.000 /sec
                 0      cpu-migrations:u                 #    0.000 /sec
           356,148      page-faults:u                    #  150.274 K/sec
     6,269,470,708      cycles:u                         #    2.645 GHz
    14,738,503,488      instructions:u                   #    2.35  insn per cycle
     3,446,098,461      branches:u                       #    1.454 G/sec
         8,069,400      branch-misses:u                  #    0.23% of all branches

       2.528378960 seconds time elapsed

       1.615366000 seconds user
       0.741772000 seconds sys

          1,255.32 msec task-clock:u                     #    0.959 CPUs utilized
                 0      context-switches:u               #    0.000 /sec
                 0      cpu-migrations:u                 #    0.000 /sec
            91,721      page-faults:u                    #   73.066 K/sec
     3,975,614,937      cycles:u                         #    3.167 GHz
     9,728,980,463      instructions:u                   #    2.45  insn per cycle
     2,260,897,196      branches:u                       #    1.801 G/sec
         4,786,343      branch-misses:u                  #    0.21% of all branches

       1.308611289 seconds time elapsed

       1.039849000 seconds user
       0.208398000 seconds sys

especially look at "instructions" and "page faults".

relation to 3.1.2:

               193      page-faults:u                    #   37.357 /sec
     4,975,843,684      cycles:u                         #    0.963 GHz
     6,891,334,795      instructions:u                   #    1.38  insn per cycle

Numbers from valgrind: total heap usage	allocs	frees	bytes allocated
GC 3.1.2	5,526	5,526	20,443,861
GC 3.2	21,677,571	21,677,571	893,390,820
GC 3.2 (after this PR)	7,803,907	7,803,903	244,140,683

So still much more than 3.1.2, but definitely a good direction.

Note that you likely have some cobc_malloc somewhere, as only after this patch valgrind outputs "heap in use at exit: 320 bytes in 4 blocks". (with 3.1.2 and 3.2 valgrind outputs "0 bytes").

For comparison to the linked PR, here's the flamegraph and the cost-overview: flamegraph from this PR

If you don't have an idea how to improve more (if it helps I could drop the inlining, providing a cleaner flamegraph and cost analysis (but of course less similar to production builds of cobc), then I'd be fine with upstreaming this now (with a changelog entry, of course and maybe a minor note to NEWS that cobc's memory usage during parsing in comparison to 3.2 was decreased).

But possibly you see something from the list above leading to even more improvements?

GitMensch commented 3 months ago

Oops - leads to 4 compile-errors in NIST SM and additional

In file included from /usr/include/string.h:535,
                 from /tmp/cob88280_0.c:8:
In function 'memmove',
    inlined from 'OBIC3A_' at /tmp/cob88280_0.c:1677:5:
/usr/include/x86_64-linux-gnu/bits/string_fortified.h:36:10: warning: '__builtin_memmove' reading 20 bytes from a region of size 0 [-Wstringop-overread]
   36 |   return __builtin___memmove_chk (__dest, __src, __len,
      |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   37 |                                   __glibc_objsize0 (__dest));
      |                                   ~~~~~~~~~~~~~~~~~~~~~~~~~~

GitMensch commented 3 months ago

Oh - if possible (for any cases like this - where NIST fails but our internal does not): please add "the issue" for the current failure to the testsuite, as this ensures higher coverage there ("someday" we don't want to run NIST that often, but our coverage is definitely not there yet).

lefessan commented 3 months ago

I think the bug is fixed now, and I added some more optimizations that I would be interested to check their performance.

About the freeing to perform, is there a function that is calling all the freeing functions on exit ?

GitMensch commented 3 months ago

About the freeing to perform, is there a function that is calling all the freeing functions on exit ?

Yes, on different places, as long as you use the right malloc - which is possibly cobc_plex_malloc in replace.c

I think the bug is fixed now

Seems so - could you get a grasp what that triggered and add it to the internal testsuite, please?

and I added some more optimizations that I would be interested to check their performance

Will check in 7 to 17 hours and report back ;-) Maybe it is useful to check in replace.c if the outline comments in the header are correct and if there are any doxygen comments on functions are missing, as well as crafting the Changelog entry and possibly a minor NEWS one.

Note: the amount of one- or two-liner static function... looks special. Please check if these are useful and mark them as always inlined.

GitMensch commented 3 months ago

Again: much better (apart from the memory free part better than both 3.1.2 and 3.2


            654.86 msec task-clock:u                     #    0.742 CPUs utilized
                 0      context-switches:u               #    0.000 /sec
                 0      cpu-migrations:u                 #    0.000 /sec
               180      page-faults:u                    #  274.869 /sec
     2,190,772,514      cycles:u                         #    3.345 GHz
     5,304,004,935      instructions:u                   #    2.42  insn per cycle
     1,309,395,192      branches:u                       #    2.000 G/sec
         4,460,189      branch-misses:u                  #    0.34% of all branches

       0.882801772 seconds time elapsed

       0.583984000 seconds user
       0.071812000 seconds sys

Numbers from valgrind: total heap usage	allocs	frees	bytes allocated
GC 3.1.2	5,526	5,526	20,443,861
GC 3.2	21,677,571	21,677,571	893,390,820
GC 3.2 (after this PR now)	4,465	4,463	20,390,950

with the flamegraph not showing the replace functions any more.

Therefore I'd say - issue fixed, only needs the "additional" parts (check use of memory-functions, reformat, in-file documentation check; handling TODOs, Changelog, ideally testsuite, possibly minor NEWS entry) then looks good to go.

For the format:

-       if (copy_repls->replace_list == NULL &&
-            copy_repls->current_list == NULL &&
-            replace_repls->replace_list == NULL &&
-            replace_repls->current_list == NULL) {
-       return cb_ppecho_direct (text, token);
-   }
+   if (copy_repls->replace_list == NULL
+    && copy_repls->current_list == NULL
+    && replace_repls->replace_list == NULL
+    && replace_repls->current_list == NULL) {
+       return cb_ppecho_direct (text, token);
+   }

It likely is useful to also compare the actual REPLACE code. I'll add a REPLACE statement + COPY REPLACING then reinspect (later, possibly in an hour).

lefessan commented 3 months ago

Great. I cleaned the PR with ChangeLog and NEWS sections. Allocations should also be correctly freed at the end.

I did not add a new entry in the testsuite, as the bug was in the data-structure token_queue, and, unless we modify this data structure, which is unlikely, it cannot be triggered anymore. The difference between the testsuite and the NIST tests came from a test in NIST that had a replacement longer than 8 tokens, which seems not to exist in our testsuite (and my token_queue implementation had a bug on that constant...).

GitMensch commented 3 months ago

make checkall also passed on the dev environment at work updated stats:

            661.16 msec task-clock:u                     #    0.735 CPUs utilized
                 0      context-switches:u               #    0.000 /sec
                 0      cpu-migrations:u                 #    0.000 /sec
               178      page-faults:u                    #  269.225 /sec
     2,242,123,238      cycles:u                         #    3.391 GHz
     5,304,096,492      instructions:u                   #    2.37  insn per cycle
     1,309,409,904      branches:u                       #    1.980 G/sec
         4,453,625      branch-misses:u                  #    0.34% of all branches

       0.899029345 seconds time elapsed

       0.584002000 seconds user
       0.075081000 seconds sys

now with

HEAP SUMMARY:
   in use at exit: 0 bytes in 0 blocks
 total heap usage: 4,465 allocs, 4,465 frees, 20,390,950 bytes allocated

All heap blocks were freed -- no leaks are possible

The "simple" adjustment did work fine (one of each,. the second nearly globally), resulting in

          1,019.13 msec task-clock:u                     #    0.957 CPUs utilized
                 0      context-switches:u               #    0.000 /sec
                 0      cpu-migrations:u                 #    0.000 /sec
            50,342      page-faults:u                    #   49.397 K/sec
     3,341,557,957      cycles:u                         #    3.279 GHz
     7,982,333,596      instructions:u                   #    2.39  insn per cycle
     1,889,809,554      branches:u                       #    1.854 G/sec
         4,612,794      branch-misses:u                  #    0.24% of all branches

       1.064843192 seconds time elapsed

       0.899148000 seconds user
       0.115552000 seconds sys

 HEAP SUMMARY:
    in use at exit: 0 bytes in 0 blocks
  total heap usage: 4,285,236 allocs, 4,285,236 frees, 144,827,144 bytes allocated

All heap blocks were freed -- no leaks are possible

... but I'm not sure if my more complex test program for COPY REPLACING / REPLACE does work correctly. I should inspect that tomorrow (but still would be fine if you want to check that in upstream today).

lefessan commented 3 months ago

I added a test case for a long replacement (and a combinaison of COPY and REPLACE on the same tokens), that would trigger the same error as the NIST test.

I am not sure what you mean by your complex program: except for the two fast paths (that were already present in 3.1.2), this PR does not change the behavior of the COPY/REPLACE, so anything that worked with 3.1.2 and 3.2 should also work with this one.

GitMensch commented 3 months ago

so anything that worked with 3.1.2 and 3.2 should also work with this one

I think so - as noted this is good to be checked in upstream.

As also noted I do need to recheck the example I've created in any case.

lefessan commented 3 months ago

I am thinking about writing a blog post about all this (how we fixed the initial bug and how we optimized it to match the perf of 3.1.2), do you happen to have a flamegraph of the final version also ? It could make a nice illustration of the evolution of performances during the work...

GitMensch commented 3 months ago

Here you go.

flamegraph costs

GitMensch commented 3 months ago

The recordings for the flamegraph were taken by perf record call-graph dwarf,8096 --aio -z -- cobc B BIG.cob -o BIG.i, the analysis was done using Hotspot from KDAB.

lefessan commented 3 months ago

Thanks ! Is big.cob public ?

GitMensch commented 3 months ago

Thanks ! Is big.cob public ?

No chance. But I can share that the file is generated by Oracle Pro*COB from an (also generated) file handling SQL.
After all processing of copybooks and removing any comment or empty lines this source still has 747,203 LOC - and the whole part of preparsing the Pro*COB generated file (.cob -> .i - the part that we inspected here) now takes 661.31 msec task-clock; the complete processing (.cob with -fsyntax-only) takes 4,738.12 msec task-clock and processing + codegen (.cob-> .c) needs 7,693.71 msec task-clock.

lefessan commented 3 months ago

Merged in SVN !

GitMensch commented 3 months ago

Well done. If you write that blog entry please drop a note in the discussion boards under GnuCOBOL.

Note: while we've inspected the preprocessing here I've worked this week on the parsing speed and memory usage, especially for source files with lots of FILLERs as seen especially in generated programs and EXEC SQL preprocessed ones (speed inspected with perf/Hotspot, memory usage with Heaptrack) and hope to commit this soon as well.

... actually I hope to commit anything of the 95% stuff laying around, those tend to pile up.

OCamlPro / gnucobol

attempt at optimizing allocs in replace.c #142