Closed lefessan closed 3 months ago
This version passes the testsuite on my computer.
@GitMensch You may want to test it on your example to measure the number of allocations. It's supposed to allocate less than the original implementation in gnucobol 3.1, as there is no allocation of intermediate lists.
related to #75
This looks quite promising. perf stat
, first for 3.2, then for 3.2 + this change:
2,369.99 msec task-clock:u # 0.937 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
356,148 page-faults:u # 150.274 K/sec
6,269,470,708 cycles:u # 2.645 GHz
14,738,503,488 instructions:u # 2.35 insn per cycle
3,446,098,461 branches:u # 1.454 G/sec
8,069,400 branch-misses:u # 0.23% of all branches
2.528378960 seconds time elapsed
1.615366000 seconds user
0.741772000 seconds sys
1,255.32 msec task-clock:u # 0.959 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
91,721 page-faults:u # 73.066 K/sec
3,975,614,937 cycles:u # 3.167 GHz
9,728,980,463 instructions:u # 2.45 insn per cycle
2,260,897,196 branches:u # 1.801 G/sec
4,786,343 branch-misses:u # 0.21% of all branches
1.308611289 seconds time elapsed
1.039849000 seconds user
0.208398000 seconds sys
especially look at "instructions" and "page faults".
relation to 3.1.2:
193 page-faults:u # 37.357 /sec
4,975,843,684 cycles:u # 0.963 GHz
6,891,334,795 instructions:u # 1.38 insn per cycle
Numbers from valgrind: total heap usage | allocs | frees | bytes allocated |
---|---|---|---|
GC 3.1.2 | 5,526 | 5,526 | 20,443,861 |
GC 3.2 | 21,677,571 | 21,677,571 | 893,390,820 |
GC 3.2 (after this PR) | 7,803,907 | 7,803,903 | 244,140,683 |
So still much more than 3.1.2, but definitely a good direction.
Note that you likely have some cobc_malloc
somewhere, as only after this patch valgrind outputs "heap in use at exit: 320 bytes in 4 blocks". (with 3.1.2 and 3.2 valgrind outputs "0 bytes").
For comparison to the linked PR, here's the flamegraph and the cost-overview:
If you don't have an idea how to improve more (if it helps I could drop the inlining, providing a cleaner flamegraph and cost analysis (but of course less similar to production builds of cobc), then I'd be fine with upstreaming this now (with a changelog entry, of course and maybe a minor note to NEWS that cobc's memory usage during parsing in comparison to 3.2 was decreased).
But possibly you see something from the list above leading to even more improvements?
Oops - leads to 4 compile-errors in NIST SM and additional
In file included from /usr/include/string.h:535,
from /tmp/cob88280_0.c:8:
In function 'memmove',
inlined from 'OBIC3A_' at /tmp/cob88280_0.c:1677:5:
/usr/include/x86_64-linux-gnu/bits/string_fortified.h:36:10: warning: '__builtin_memmove' reading 20 bytes from a region of size 0 [-Wstringop-overread]
36 | return __builtin___memmove_chk (__dest, __src, __len,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
37 | __glibc_objsize0 (__dest));
| ~~~~~~~~~~~~~~~~~~~~~~~~~~
Oh - if possible (for any cases like this - where NIST fails but our internal does not): please add "the issue" for the current failure to the testsuite, as this ensures higher coverage there ("someday" we don't want to run NIST that often, but our coverage is definitely not there yet).
I think the bug is fixed now, and I added some more optimizations that I would be interested to check their performance.
About the freeing to perform, is there a function that is calling all the freeing functions on exit ?
About the freeing to perform, is there a function that is calling all the freeing functions on exit ?
Yes, on different places, as long as you use the right malloc - which is possibly cobc_plex_malloc
in replace.c
I think the bug is fixed now
Seems so - could you get a grasp what that triggered and add it to the internal testsuite, please?
and I added some more optimizations that I would be interested to check their performance
Will check in 7 to 17 hours and report back ;-) Maybe it is useful to check in replace.c if the outline comments in the header are correct and if there are any doxygen comments on functions are missing, as well as crafting the Changelog entry and possibly a minor NEWS one.
Note: the amount of one- or two-liner static function... looks special. Please check if these are useful and mark them as always inlined.
Again: much better (apart from the memory free part better than both 3.1.2 and 3.2
654.86 msec task-clock:u # 0.742 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
180 page-faults:u # 274.869 /sec
2,190,772,514 cycles:u # 3.345 GHz
5,304,004,935 instructions:u # 2.42 insn per cycle
1,309,395,192 branches:u # 2.000 G/sec
4,460,189 branch-misses:u # 0.34% of all branches
0.882801772 seconds time elapsed
0.583984000 seconds user
0.071812000 seconds sys
Numbers from valgrind: total heap usage | allocs | frees | bytes allocated |
---|---|---|---|
GC 3.1.2 | 5,526 | 5,526 | 20,443,861 |
GC 3.2 | 21,677,571 | 21,677,571 | 893,390,820 |
GC 3.2 (after this PR now) | 4,465 | 4,463 | 20,390,950 |
with the flamegraph not showing the replace functions any more.
Therefore I'd say - issue fixed, only needs the "additional" parts (check use of memory-functions, reformat, in-file documentation check; handling TODOs, Changelog, ideally testsuite, possibly minor NEWS entry) then looks good to go.
For the format:
- if (copy_repls->replace_list == NULL &&
- copy_repls->current_list == NULL &&
- replace_repls->replace_list == NULL &&
- replace_repls->current_list == NULL) {
- return cb_ppecho_direct (text, token);
- }
+ if (copy_repls->replace_list == NULL
+ && copy_repls->current_list == NULL
+ && replace_repls->replace_list == NULL
+ && replace_repls->current_list == NULL) {
+ return cb_ppecho_direct (text, token);
+ }
It likely is useful to also compare the actual REPLACE
code. I'll add a REPLACE
statement + COPY REPLACING
then reinspect (later, possibly in an hour).
Great. I cleaned the PR with ChangeLog and NEWS sections. Allocations should also be correctly freed at the end.
I did not add a new entry in the testsuite, as the bug was in the data-structure token_queue
, and, unless we modify this data structure, which is unlikely, it cannot be triggered anymore. The difference between the testsuite and the NIST tests came from a test in NIST that had a replacement longer than 8 tokens, which seems not to exist in our testsuite (and my token_queue
implementation had a bug on that constant...).
make checkall
also passed on the dev environment at work updated stats:
661.16 msec task-clock:u # 0.735 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
178 page-faults:u # 269.225 /sec
2,242,123,238 cycles:u # 3.391 GHz
5,304,096,492 instructions:u # 2.37 insn per cycle
1,309,409,904 branches:u # 1.980 G/sec
4,453,625 branch-misses:u # 0.34% of all branches
0.899029345 seconds time elapsed
0.584002000 seconds user
0.075081000 seconds sys
now with
HEAP SUMMARY:
in use at exit: 0 bytes in 0 blocks
total heap usage: 4,465 allocs, 4,465 frees, 20,390,950 bytes allocated
All heap blocks were freed -- no leaks are possible
The "simple" adjustment did work fine (one of each,. the second nearly globally), resulting in
1,019.13 msec task-clock:u # 0.957 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
50,342 page-faults:u # 49.397 K/sec
3,341,557,957 cycles:u # 3.279 GHz
7,982,333,596 instructions:u # 2.39 insn per cycle
1,889,809,554 branches:u # 1.854 G/sec
4,612,794 branch-misses:u # 0.24% of all branches
1.064843192 seconds time elapsed
0.899148000 seconds user
0.115552000 seconds sys
HEAP SUMMARY:
in use at exit: 0 bytes in 0 blocks
total heap usage: 4,285,236 allocs, 4,285,236 frees, 144,827,144 bytes allocated
All heap blocks were freed -- no leaks are possible
... but I'm not sure if my more complex test program for COPY REPLACING
/ REPLACE
does work correctly.
I should inspect that tomorrow (but still would be fine if you want to check that in upstream today).
I added a test case for a long replacement (and a combinaison of COPY and REPLACE on the same tokens), that would trigger the same error as the NIST test.
I am not sure what you mean by your complex program: except for the two fast paths (that were already present in 3.1.2), this PR does not change the behavior of the COPY/REPLACE, so anything that worked with 3.1.2 and 3.2 should also work with this one.
so anything that worked with 3.1.2 and 3.2 should also work with this one
I think so - as noted this is good to be checked in upstream.
As also noted I do need to recheck the example I've created in any case.
I am thinking about writing a blog post about all this (how we fixed the initial bug and how we optimized it to match the perf of 3.1.2), do you happen to have a flamegraph of the final version also ? It could make a nice illustration of the evolution of performances during the work...
Here you go.
The recordings for the flamegraph were taken by perf record call-graph dwarf,8096 --aio -z -- cobc B BIG.cob -o BIG.i
, the analysis was done using Hotspot from KDAB.
Thanks ! Is big.cob
public ?
Thanks ! Is
big.cob
public ?
No chance. But I can share that the file is generated by Oracle Pro*COB from an (also generated) file handling SQL.
After all processing of copybooks and removing any comment or empty lines this source still has 747,203 LOC - and the whole part of preparsing the Pro*COB generated file (.cob -> .i - the part that we inspected here) now takes 661.31 msec task-clock; the complete processing (.cob with -fsyntax-only
) takes 4,738.12 msec task-clock and processing + codegen (.cob-> .c) needs 7,693.71 msec task-clock.
Merged in SVN !
Well done. If you write that blog entry please drop a note in the discussion boards under GnuCOBOL.
Note: while we've inspected the preprocessing here I've worked this week on the parsing speed and memory usage, especially for source files with lots of FILLERs as seen especially in generated programs and EXEC SQL preprocessed ones (speed inspected with perf/Hotspot, memory usage with Heaptrack) and hope to commit this soon as well.
... actually I hope to commit anything of the 95% stuff laying around, those tend to pile up.
There are two steps to remove allocations: