AntelopeIO/cdt#102 is a long time thorn and while obviously it would be ideal to fix cdt to not generate these host function calls in the first place, it's not clear when that will happen and regardless there will be many contracts that may never be recompiled and years of historical blocks that will have these calls forever.
This implements a workaround in EOS VM OC that identifies small constant size memcpy host function calls during code generation and replaces them with simple memory loads and stores plus a small call to a native function that verifies the memcpy ranges do not overlap as required by protocol rules.
pushq %rax
decl %gs:-18888 ;;decrement depth count and jump if zero
je 51
movq %gs:200, %rax ;;load 8 bytes from address 200 in to register
movq %rax, %gs:4 ;;store 8 bytes to address 4 from register
movl $4, %edi ;;prepare the 3 parameters to check_memcpy_params
movl $200, %esi
movl $8, %edx
callq *%gs:-21056 ;;call check_memcpy_params
incl %gs:-18888 ;;increment depth count
popq %rax
retq
callq *%gs:-18992 ;;call depth_assert
(in practice the source and destination addresses are typically not constants)
For some tested block ranges starting at 346025446, 348083350, 371435719, and 396859576 replay performance increases within a range of 3.5% to 6%. This is a bit lower than I expected -- I was expecting consistently north of 5% -- the reason for missing my expectation is that there are still many memcpy host function calls that are not optimized out, and the overlap check can still consume significant amount of CPU.
AntelopeIO/cdt#102 is a long time thorn and while obviously it would be ideal to fix cdt to not generate these host function calls in the first place, it's not clear when that will happen and regardless there will be many contracts that may never be recompiled and years of historical blocks that will have these calls forever.
This implements a workaround in EOS VM OC that identifies small constant size memcpy host function calls during code generation and replaces them with simple memory loads and stores plus a small call to a native function that verifies the memcpy ranges do not overlap as required by protocol rules.
As a simple example, a contract such as,
Will generate machine code (annotated by me)
(in practice the source and destination addresses are typically not constants)
For some tested block ranges starting at 346025446, 348083350, 371435719, and 396859576 replay performance increases within a range of 3.5% to 6%. This is a bit lower than I expected -- I was expecting consistently north of 5% -- the reason for missing my expectation is that there are still many memcpy host function calls that are not optimized out, and the overlap check can still consume significant amount of CPU.
A run of the LLVM exhaustive workflow is at, https://github.com/AntelopeIO/spring/actions/runs/11711214896