ethereum / solidity

Solidity, the Smart Contract Programming Language
https://soliditylang.org
GNU General Public License v3.0
22.64k stars 5.61k forks source link

Deduplicate bytecode dependencies used by both creation and deployed object #15178

Open cameel opened 3 weeks ago

cameel commented 3 weeks ago

Abstract

When a contract deploys another contract (via new) or accesses its bytecode (via .runtimeObject or .creationCode), the compiler embeds that bytecode in the accessing contract. Depending on whether the contract is accessed at creation time or at runtime, its bytecode ends up as a subassembly of the creation or runtime assembly, respectively. However, when it is accessed at both times it ends up being included in both places.

This happens in both the legacy and the IR pipeline.

Details

This behavior can be clearly seen in the IR codegen. It's clear that there's no attempt at deduplication: https://github.com/ethereum/solidity/blob/8a97fa7a1db1ec509221ead6fea6802c684ee887/libsolidity/codegen/ir/IRGenerator.cpp#L184 https://github.com/ethereum/solidity/blob/8a97fa7a1db1ec509221ead6fea6802c684ee887/libsolidity/codegen/ir/IRGenerator.cpp#L206

Motivation

This duplication not only increases bytecode size but also requires the compiler to optimize the same code twice, which is especially problematic for the IR pipeline.

How to reproduce

// SPDX-License-Identifier: GPL-3.0
pragma solidity *;

contract C {
    function get() public pure returns (string memory){
        return
            "DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD"
            "DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD"
            "DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD"
            "DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD";
    }
}

contract D {
    constructor() {
        new C();
    }

    function deploy() public {
        new C();
    }
}
solc test.sol --bin --optimize | fold --width 115
======= test.sol:C =======
Binary:
6080604052348015600e575f80fd5b506102158061001c5f395ff3fe608060405234801561000f575f80fd5b5060043610610029575f3560e01
c80636d4ce63c1461002d575b5f80fd5b61003561004b565b604051610042919061006e565b60405180910390f35b6060604051806101600160
40528061013c81526020016100a461013c9139905090565b602081525f82518060208401528060208501604085015e5f6040828501015260406
01f19601f8301168401019150509291505056fe4444444444444444444444444444444444444444444444444444444444444444444444444444
4444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444
4444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444
4444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444
4444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444
444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444a264697066735822122
08fe7efa4f2091676f1aa8adfa67663f3c526fd2980562068555a823a144492ae64736f6c634300081a0033

======= test.sol:D =======
Binary:
6080604052348015600e575f80fd5b506102cc8061001c5f395ff3fe6080604052348015600e575f80fd5b50600436106026575f3560e01c806
3775c300c14602a575b5f80fd5b60306032565b005b604051603c906058565b604051809103905ff0801580156054573d5f803e3d5ffd5b5050
565b610231806100668339019056fe6080604052348015600e575f80fd5b506102158061001c5f395ff3fe608060405234801561000f575f80f
d5b5060043610610029575f3560e01c80636d4ce63c1461002d575b5f80fd5b61003561004b565b604051610042919061006e565b6040518091
0390f35b606060405180610160016040528061013c81526020016100a461013c9139905090565b602081525f825180602084015280602085016
04085015e5f604082850101526040601f19601f8301168401019150509291505056fe4444444444444444444444444444444444444444444444
4444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444
4444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444
4444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444
4444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444
4444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444
44444444444a26469706673582212208fe7efa4f2091676f1aa8adfa67663f3c526fd2980562068555a823a144492ae64736f6c634300081a00
33a2646970667358221220550677e0803a543f7b74c8ba1d63bb1aee078a72d6f8de64f7251927eefa15c664736f6c634300081a0033

Bytecode of C clearly is included inside bytecode of D twice. The long string included in the source can be seen twice as two long sequences of 0x44.

The effect is similar with --via-ir, though the string is not as easily visible. Still, it can be easily confirmed by removing D's constructor - it cuts the side of D' bytecode almost in half.

Possible solutions

  1. Embed the dependency only once and have both creation and deployment refer to the same subassembly.
  2. Alternatively, we could also consider deduplicating the subassemblies in the EVM asm optimizer.

Backwards Compatibility

I can't see any backwards compatibility issues here.

cameel commented 3 weeks ago

As pointed out by @ekpyron, we do have some kind of deduplication at assembly level already: https://github.com/ethereum/solidity/blob/5da0f47439340097fe5b8a409caadc4fd3f00752/libevmasm/Assembly.cpp#L1053-L1071

Before fixing this we should check why it doesn't kick in here. Does the generated bytecode end up being different?

ekpyron commented 3 weeks ago

It may be that subRef in the existing duplication only extends to one nesting level, so it may be that we don't deduplicate a subassembly with an indentical subassembly of a subassembly that way. In cases like that, we also need to be careful about this: the outer assembly can reuse the inner one (the subassembly of a subassembly), but we can't change the nested assembly (i.e. the subassembly who expects it as its own subassembly cannot instead refer to the outside assembly) - both because it won't have access to it and because of determinism.