`memory-safe` Assembly Worsens Optimizations

nlordell commented 7 months ago

I did my best to ensure that this issue has not already been reported. I apologise if I missed an existing issue.

Description

I cam across some specific cases where tagging assembly as memory-safe is detrimental to gas optimizations. In particular, the attached case led to ~16.6% increased gas consumption where tagging assembly as memory-safe.

Code listing

```solidity // SPDX-License-Identifier: LGPL-3.0-only pragma solidity ^0.8.20; contract Base64Url { function encode(bytes32 input) private pure returns (string memory output) { assembly ("memory-safe") { output := mload(0x40) mstore(0x40, add(output, 96)) mstore(output, 43) // Write the base-64 lookup table to the scratch space, simplifying the copy code. Note // that we write the prefix with `MSTORE` as Solidity does not have support for // `CODECOPY` from bytes or string constants in inline assembly. mstore(0, "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdef") mstore(32, "ghijklmnopqrstuvwxyz0123456789-_") let ptr := add(output, 32) for { let shift := 250 } sgt(shift, 0) { shift := sub(shift, 6) } { mstore8(ptr, byte(0, mload(and(0x3f, shr(shift, input))))) ptr := add(ptr, 1) } mstore8(ptr, byte(0, mload(and(0x3f, shl(2, input))))) } } function bench() external view returns (uint256 gas, bytes32 hash) { bytes32 input = keccak256("test"); gas = gasleft(); string memory output = encode(input); assembly ("memory-safe") { pop(staticcall(gas(), 0x2, output, mload(output), 0, 32)) hash := mload(0) } bytes memory data = abi.encode(hash); assembly ("memory-safe") { pop(staticcall(gas(), 0x2, data, mload(data), 0, 32)) hash := mload(0) } gas = gas - gasleft(); } } ```

In this particular case, removing the memory-safe tag from any of the assembly blocks with result in much more optimal code. From what I understand, having a memory unsafe block will cause the compiler to disable a class of optimization, of them one particular one which is causing it to generate non-optimal code.

I analyzed the compiler output, and it appears to be generating a lot of additional DUP, SWAP and POP stack shuffling inside the for loop when all the optimizations are enabled with the memory-safe tag, and, because the for loop iterates 42 times, causes the large gas discrepancy. Additionally, it is worth noting that the following two staticcalls to the SHA-256 precompiles are crucial to confusing the optimizer.

Some other note worthy findings that support that the optimizer is messing up the stack shuffling in this case are:

The stack shuffling issue is specific to the two staticcalls to the SHA-256 precompile that follow. In particular, if you remove any of them, the problem is no longer observed and you get the expected behaviour of code with memory-safe assembly tags being slightly more performant than potentially memory unsafe assembly.
If you force the function to not be inlined (by having it be called by more than one external function for example), then the bug goes away, and the compiler generates optimal code again. So really the issue is the assembly block followed by the two staticcalls.

(Also on an unrelated note, I noticed that it is compiling sub(shift, 6) as add(shift, 0xffff...ffa which is 31 more code bytes and also seems like a minor bug...)

Environment

Compiler version: 0.8.24 (also checked in earlier versions down to 0.8.20 inclusive)
Target EVM version (as per compiler settings): default
Framework/IDE (e.g. Truffle or Remix): Remix/Hardhat
EVM execution environment / backend / blockchain client: Remix/Hardhat
Operating system: Linux

Steps to Reproduce

See code listing :point_up:. Compiler settings:

{
  "optimizer": {
    "enabled": true,
    "runs": 100000
  },
  "viaIR": true
}

cameel commented 7 months ago

I can confirm this. I see that in the optimized Yul code the loop is identical in both cases and the difference is only visible in assembly. This means that it must be the effect of the optimized Yul->EVM transform. We fall back to the legacy transform if the assembly blocks are not memory-safe.

Honestly, I'm not sure if this is something we'll be able to fix. I think it's likely to not be a result of a bug that can be simply "fixed", but rather a trade-off that might affect other cases. We're generally working on improving the transform, so this may also improve as a result of the upcoming changes and might not make sense to target specifically. Need @ekpyron's opinion here.

Details

The stack shuffling issue is specific to the two staticcalls to the SHA-256 precompile that follow. In particular, if you remove any of them, the problem is no longer observed and you get the expected behaviour of code with memory-safe assembly tags being slightly more performant than potentially memory unsafe assembly.

Interesting. I expected that this would be caused by stuff being optimized out without the calls, but no. When I comment out the staticcalls the loop is still there, and still the same in both cases, just processed differently by the transform.

(Also on an unrelated note, I noticed that it is compiling sub(shift, 6) as add(shift, 0xffff...ffa which is 31 more code bytes and also seems like a minor bug...)

Maybe we should reconsider #6765 after all. I think it happens because Yul optimizer always converts sub() to add() and in this form this leaves the constant optimizer only the choice between add(shift, not(5)) (which is more space-efficient and chosen for lower runs values) and add(shift, 0xfffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffa) (which is cheaper to execute). Converting it back to sub(shift, 6) would be a much better alternative, since it's both cheaper and more space-efficient.

Workaround

I would not necessarily recommend it as a general solution, but as of 0.8.25 we still support the stackAllocation option in Standard JSON, which lets you select the old Yul->EVM transform. Keep in mind though that we're going to deprecate it eventually.

Repro

CLI repro using Foundry. Assumes that the contract from the issue description is in a file called base64-safe.sol.

export seed="grace crime cat remove spice bean concert lawsuit render horse collect vocal"
anvil --mnemonic "$seed"

export key=0x60b139825a56a987d58b20f0145e05dc45bed12df72cb92812b5ea988383c987

function benchmark_contract {
    solc "$1".sol --via-ir --optimize --optimize-runs 100000 --bin |
        sed -e '/^=======.*=======$/d' -e '/^Binary:$/d' -e '/^\s*$/d' > "$1".bin
    local address
    address=$(
        cast send --json --private-key "$key" --create "$(cat "$1".bin)" |
            jq --raw-output .contractAddress
    )
    gas=$(
        cast send --json --private-key "$key" "$address" "bench()" |
            jq --raw-output .gasUsed
    )
    echo "$((gas - 21000))"
}

benchmark_contract base64-safe

sed -e 's/("memory-safe")//g' base64-safe.sol > base64-unsafe.sol
benchmark_contract base64-unsafe

5104
4486

nlordell commented 7 months ago

Thanks for the detailed response.

Another possible work around would be to split the code into separate functions, one with the loop and another with the static calls, and force the loop to not be optimized somehow (we can achieve this by having an additional external function that calls the loop function). I verified it with the above code and it had the desired effect.

Is there a better way to tell the Yul optimizer "please don't inline this function" without changing the ABI of the contract?

ethereum / solidity