refactor: use memcpy for by-val aggregate type input parameters

mhasel commented 1 month ago

Aggregate VAR_INPUT args to function calls are now generated/passed as pointers and then memcpyd into a local variable instead of passing it by value and using store. In order to achieve this, quite a bit of logic is moved from the expression_generator to the pou_generator - in other words, the caller will now only bitcast an aggregate argument to its pointer (if necessary) and the function will take care of correctly memseting/memcpying. This results in significantly reduced allocations/IR in some cases, especially when passing member variables of FUNCTION_BLOCK/PROGRAM structs or when passing a by-ref arg on to a by-val parameter: Where previously the caller had to allocate a local temporary variable and copy the value into it before passing it on to the callee, it is now sufficient to directly pass the pointer.

Using the same example as given in issue #1074

FUNCTION bar : DINT
    VAR_INPUT
        val : STRING[65536];
    END_VAR
END_FUNCTION

the llc-14 --time-passes benchmark improves significantly:

master/store:

===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 69.0989 seconds (69.0998 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  64.0869 ( 93.0%)   0.0700 ( 43.7%)  64.1569 ( 92.8%)  64.1579 ( 92.8%)  X86 DAG->DAG Instruction Selection
   4.6626 (  6.8%)   0.0000 (  0.0%)   4.6626 (  6.7%)   4.6626 (  6.7%)  Machine Instruction Scheduler
   0.0767 (  0.1%)   0.0900 ( 56.2%)   0.1667 (  0.2%)   0.1667 (  0.2%)  X86 Assembly Printer

...

===-------------------------------------------------------------------------===
                      Instruction Selection and Scheduling
===-------------------------------------------------------------------------===
  Total Execution Time: 61.4012 seconds (61.4021 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  60.5939 ( 98.7%)   0.0300 ( 75.0%)  60.6238 ( 98.7%)  60.6248 ( 98.7%)  DAG Combining 1
   0.3744 (  0.6%)   0.0000 (  0.0%)   0.3744 (  0.6%)   0.3744 (  0.6%)  Instruction Selection
   0.1517 (  0.2%)   0.0000 (  0.0%)   0.1517 (  0.2%)   0.1517 (  0.2%)  DAG Combining 2
   0.1485 (  0.2%)   0.0000 (  0.0%)   0.1485 (  0.2%)   0.1485 (  0.2%)  Instruction Scheduling
   0.0481 (  0.1%)   0.0000 (  0.0%)   0.0481 (  0.1%)   0.0481 (  0.1%)  DAG Legalization

...

memcpy:

===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 0.0016 seconds (0.0017 wall clock)

   ---User Time---   --User+System--   ---Wall Time---  --- Name ---
   0.0004 ( 23.3%)   0.0004 ( 23.3%)   0.0004 ( 23.3%)  X86 DAG->DAG Instruction Selection
   0.0004 ( 23.2%)   0.0004 ( 23.2%)   0.0004 ( 23.1%)  Expand Atomic instructions
   0.0002 ( 10.6%)   0.0002 ( 10.6%)   0.0002 ( 10.6%)  X86 Assembly Printer

...

===-------------------------------------------------------------------------===
                      Instruction Selection and Scheduling
===-------------------------------------------------------------------------===
  Total Execution Time: 0.0002 seconds (0.0002 wall clock)

   ---User Time---   --User+System--   ---Wall Time---  --- Name ---
   0.0001 ( 46.7%)   0.0001 ( 46.7%)   0.0001 ( 46.9%)  Instruction Selection
   0.0000 ( 19.5%)   0.0000 ( 19.5%)   0.0000 ( 19.8%)  DAG Combining 1
   0.0000 ( 13.8%)   0.0000 ( 13.8%)   0.0000 ( 13.5%)  Instruction Scheduling
   0.0000 ( 10.5%)   0.0000 ( 10.5%)   0.0000 ( 10.4%)  Instruction Creation
   0.0000 (  3.8%)   0.0000 (  3.8%)   0.0000 (  3.5%)  DAG Combining 2
   0.0000 (  3.3%)   0.0000 (  3.3%)   0.0000 (  3.2%)  DAG Legalization

...

Pass execution timing and instruction selection and scheduling improve by a factor of ~40000 and ~300000 respectively.

Resolves https://github.com/PLC-lang/rusty/issues/1074

volsa commented 3 weeks ago

As a side note, is this a good candidate to expand our performance tests to detect potential regressions? That is create a test case with many big aggregate types all passed by value and track their runtime behaviour in our dashboard?

mhasel commented 3 weeks ago

As a side note, is this a good candidate to expand our performance tests to detect potential regressions? That is create a test case with many big aggregate types all passed by value and track their runtime behaviour in our dashboard?

Sounds good. This would also allow to better test future front-end optimizations (e.g. more accurate byte-alignment for memset/memcpy calls, ...)

PLC-lang / rusty

refactor: use memcpy for by-val aggregate type input parameters #1196