Immutable dynamic arrays

nventuro commented 2 years ago

Abstract

Add native support for dynamic immutable arrays, with a length defined at construction time (up to a maximum) and automatic bounds checks.

Motivation

immutable is a fantastic feature, allowing for highly efficient configuration access without forcing developers to use hardcoded constants, making testing and deployment much easier and enjoyable. However, there is zero support for any kind of arrays.

What I describe can be manually implemented today in solc >0.7, but it leads to a very large amount of error-prone boilerplate. No alternative exists for gas-efficient access to these values. You can look at production code using this pattern here. Using the proposed features in such a contract would reduce sour code size to less than a third of the original length, while also increasing auditability and maintainability.

Specification

'Dynamic' may be a confusing term. What I mean is that the contents and length would be dynamically defined at construction time. Read access with bounds check using the [] operator and .length would both be supported. This feature set is essentially the same as that of memory arrays, except memory arrays can have their contents mutated.

Truly dynamic immutable arrays would be quite an undertaking as the length of the runtime code would be dependent on construction arguments, but I'd argue that going that far is unnecessary. In all applications I've seen, there is a compile-time known maximum array length.

The proposed feature could therefore work as follows:

the user declares an immutable array with a maximum length N (e.g. uint256[] private immutable(50) foo)
the compiler allocates runtime code equivalent to N+1 immutable values, for each element plus the total length
during construction, both foo.length and foo[i] can be used to write (not read) to the array. For simplicity, we could allow for each value to be written multiple times and perform no bounds check (other than .length <= N and i < N).
during runtime, foo.length returns the array's length and foo[i] performs bounds check with foo.length.

Backwards Compatibility

This is a new feature using brand new (as of today invalid) syntax, so there should be no backwards compatibility implications.

Extra

Code samples

As mentioned above, this can be implemented today using standard Solidity. See for example how one might declare and use an up to 10 tokens array:

uint256 private immutable _totalTokens;

IERC20 internal immutable _token0;
IERC20 internal immutable _token1;
IERC20 internal immutable _token2;
IERC20 internal immutable _token3;
IERC20 internal immutable _token4;
IERC20 internal immutable _token5;
IERC20 internal immutable _token6;
IERC20 internal immutable _token7;
IERC20 internal immutable _token8;
IERC20 internal immutable _token9;

constructor(IERC20[] memory tokens) {
    uint256 numTokens = tokens.length;
    require(numTokens < 10);
    _totalTokens = numTokens;

    _token0 = numTokens > 0 ? tokens[0] : IERC20(0);
    _token1 = numTokens > 1 ? tokens[1] : IERC20(0);
    _token2 = numTokens > 2 ? tokens[2] : IERC20(0);
    _token3 = numTokens > 3 ? tokens[3] : IERC20(0);
    _token4 = numTokens > 4 ? tokens[4] : IERC20(0);
    _token5 = numTokens > 5 ? tokens[5] : IERC20(0);
    _token6 = numTokens > 6 ? tokens[6] : IERC20(0);
    _token7 = numTokens > 7 ? tokens[7] : IERC20(0);
    _token8 = numTokens > 8 ? tokens[8] : IERC20(0);
    _token9 = numTokens > 9 ? tokens[9] : IERC20(0);
}

function getTokenIndex(IERC20 token) public view returns (uint256) {
    if (token == _token0) { return 0; }
    else if (token == _token1) { return 1; }
    else if (token == _token2) { return 2; }
    else if (token == _token3) { return 3; }
    else if (token == _token4) { return 4; }
    else if (token == _token5) { return 5; }
    else if (token == _token6) { return 6; }
    else if (token == _token7) { return 7; }
    else if (token == _token8) { return 8; }
    else if (token == _token9) { return 9; }
    else {
        revert("Invalid token");
    }
}

This is of course very error-prone and hard to maintain, but it is the best way to get the desired behavior. With the suggested feature, we get the exact same bytecode size and performance using the following code:

IERC20[] internal immutable(10) _tokens;

constructor(IERC20[] memory tokens) {
    _tokens.length = tokens.length;

    for (uint256 i = 0; i < tokens.length; ++i) {
      _tokens[i] = tokens[i];
    }
}

function getTokenIndex(IERC20 token) public view returns (uint256) {
  for (uint256 i = 0; i < _tokens.length; ++i) {
    if (_tokens[i] == token) {
      return i;
    }
  }

  revert("invalid token");
}

ylv-io commented 2 years ago

love the code sample

jaglinux commented 2 years ago

Since it's really not dynamic array, better syntax would be IERC20[10] internal immutable _tokens;

ekpyron commented 2 years ago

Thanks for the issue - this has been on our longer term agenda ever since we introduced immutables :-) (I'm actually surprised that it wasn't requested much earlier ;-)). At least I hadn't considered a maximal length for such dynamic immutable arrays - that's an interesting idea that could make this easier (but I could also imagine that fully dynamic arrays would not be that much harder actually).

One thing to be aware of is that this will not be as cheap as value-type immutables, because we won't be able to inline the values directly - in most cases I'd expect the values to end up being codecopyd to memory before use, which has some overhead (although it will be significantly cheaper than storage of course).

In retrospect, I'd much rather have introduced what we now have as immutables as a new data location called code (besides memory, calldata and storage), which would convey more clearly what actually happens and would make it less of a concern to allow multiple assignments during construction (there's no technical reason for disallowing that - during construction the values just live in memory - it's just contrary to the meaning of "immutable"...)

nventuro commented 2 years ago

(but I could also imagine that fully dynamic arrays would not be that much harder actually).

I haven't given this much thought, so I can't quite comment on complexity of such an implementation. In all scenarios where I'd use this however, a maximum length (which also means there's some wasted bytecode on the unused slots) is good enough.

One thing to be aware of is that this will not be as cheap as value-type immutables, because we won't be able to inline the values directly - in most cases I'd expect the values to end up being codecopyd to memory before use, which has some overhead (although it will be significantly cheaper than storage of course).

Yes, while the bytecode size for immutable variables in my example would be the same, the actual code that accesses them would not (both size and gas would differ). I imagine instead of a push we'd just have codecopy + mload (with no need to expand allocated memory), which would use very little gas.

(except for .length, which would be a push)

simontianx commented 2 years ago

Just would like to bring up a particular use case of mapping to an immutable array like in mapping(uint256=>uint256[10] immutable) data. This may find applications in ordered NFT batch standards.

nventuro commented 2 years ago

Hm, I'm not quite sure how you'd do mappings as they are currently extremely storage based (since they key is just a 256 bit hash, there's no notion of the 'capacity' of the map). But you'd likely be able to implement such a thing yourself using multiple immutable arrays (a few with the actual data, and one that tells you which mapping to use), though it'd require some boilerplate similar to the one in my code sample, even if this feature was implemented.

I'd limit discussion here to the feature discussed in the opening post.

simontianx commented 2 years ago

Hm, I'm not quite sure how you'd do mappings as they are currently extremely storage based (since they key is just a 256 bit hash, there's no notion of the 'capacity' of the map).

I think it's similar to mapping(uint256=>uint256[10]) data;, except the values in a mapped array cannot be modified.

But you'd likely be able to implement such a thing yourself using multiple immutable arrays (a few with the actual data, and one that tells you which mapping to use),

Is this an immutable struct?

nventuro commented 2 years ago

Hi @ekpyron! I keep running into scenarios where such a feature would be extremely useful :) Is there an estimated timeline for this feature? Is there any work the community can do to push it forward (e.g. come up with a more detailed spec, begin implementation work, etc.)?

ekpyron commented 2 years ago

@nventuro Unfortunately, there hasn't been progress on this and we don't have an estimated timeline for it. For now I moved it up in our design backlog, which will hopefully lead to us discussing it in one of our calls next week.

The main problem is that the implementation of this won't be trivial and I guess probably either @chriseth or I would have to do it and I'm not sure when we'll find the time.

Conceptually, the main question I think is whether we want to enforce an "assign once and do not read" logic for dynamic immutable types like we do for immutables of value type. It would probably in fact be easier to implement this, if we were to treat the reference-type immutable just like a memory variable during the constructor and in that sense allow reassignments and reads from it. However, it'd be counterintuitive for them to be called immutable while they can be reassigned, even if only during creation (@chriseth objected that to such plans in the past IIRC).

ekpyron commented 2 years ago

We just discussed this in our design call and decided to make it a goal for Q2 2022.

Current rough sketch of a plan is:

start off with an implementation for bytes only and then extend
the length is dynamic without syntactic maximum and stored as if it was an immutable of value type
we don't call them immutables, but introduce a data location code for them that can be passed around by reference
the above should make cheap slicing feasible
code variables during construction will behave just like memory variables, i.e. they will be normally writable and readable in memory during contract creation.

However, we will probably have some iterations on this plan as we start implementing this.

nventuro commented 2 years ago

The code part sounds very interesting! Looking forward to how this turns out, it'll likely be the tipping point that makes me migrate to 0.8 :grin:

frangio commented 2 years ago

A code location would be an awesome addition. I was going to say before that "assign once and do not read" is not the crucial part that people care about. It's all about the efficiency of access.

frangio commented 2 years ago

Here's an extreme example: 8 sets of 30 immutable values each that could be implemented as code/immutable arrays using this feature.

d-xo commented 2 years ago

How would the immutable values be stored in the code? Having dynamically sized data in the middle of the bytecode would make symbolic analysis of such bytecode even more annoying (perhaps even impossible?) than it already is.

Here are some things that are already difficult to deal with when symbolically executing (given the existance of dynamic constructor args and static immutables):

Determining when we have executed past the end of the bytecode
Building a mapping between bytecode indices and opcode locations (for source maps)

These can currently be worked around as long as dynamic data exists at the end of the bytecode, in this case we can produce an index beyond which all bytecode is known to be a symbolic value, and error out if we attempt to execute past this index (symbolic execution of symbolic opcodes is not a tractable problem). I'm not sure how I would handle this if code could contain arbitrary pieces of symbolic data with an unknown length at any location in the code.

side note: EIP-3540 cannot come soon enough...

ekpyron commented 2 years ago

I'd say that the immutable would definitely be stored after the opcodes.

ekpyron commented 2 years ago

Hm... thinking about actually implementing this, it may be a problem to parse a code data location without making code a proper keyword, which would be breaking... that's a bit unfortunate...

frangio commented 2 years ago

In that case could the keyword be immutable?

ekpyron commented 2 years ago

In that case could the keyword be immutable?

That's a bit weird for a full proper data location...

contract C {
  uint[] immutable x;
  uint[] immutable y;
  function f() public {
    uint[] immutable z = x;
    z = y;
  }
}

immutable looks just wrong for references you can reassign and pass around :-).

But I may be able to work around the parsing problem, if we restrict to arrays only... so that would mean that the first version would only allow the new code data location for array types, not for structs and we'd only add structs for 0.9 with code being a proper keyword... but I still need to check if that works...

axic commented 2 years ago

immutable looks just wrong for references you can reassign and pass around :-).

If we had a proper reference syntax we wouldn't need this new code keyword, methinks.

ekpyron commented 2 years ago

immutable looks just wrong for references you can reassign and pass around :-).

If we had a proper reference syntax we wouldn't need this new code keyword, methinks.

Can you elaborate on that? Reference syntax and data locations are pretty orthogonal.

axic commented 2 years ago

The only reason we need code data location in that statement, because otherwise it is unclear where it should be. "Is it a copy, is it a reference?"

ekpyron commented 2 years ago

The only reason we need code data location in that statement, because otherwise it is unclear where it should be. "Is it a copy, is it a reference?"

I honestly don't understand what you're saying :-). The code data location would distinguish state variables in storage from variables in the data section of the contract code - and distinguish references of things in code from references to things in storage. I see no relation to the question of references or copies whatsoever here.

k06a commented 2 years ago

Would love to see code location modifier implemented!

k06a commented 1 year ago

Any chance to have it implemented?

cameel commented 1 year ago

Yes. It's on our roadmap, planned for Q3: #13723.

ekpyron commented 1 year ago

By the way: we may want to reconsider the name after all... it's not unlikely that one of the upcoming EVM upgrades will move from codecopy to datacopy, which robs us of the justification of the name :-).

Not that it's the most important holdup here. But open for other suggestions. Still not a fan of keeping to call it immutable due to local reference variables looking strange that way.

k06a commented 1 year ago

@ekpyron I would propose to use naming codedata because of it’s similarity to calldata and same length of the keyword.

k06a commented 1 year ago

Will we be able to use this keyword for constructor arguments?

cameel commented 1 year ago

I don't think it makes sense for arguments of constructors or external functions. What would that even mean?

k06a commented 1 year ago

Currently constructor arguments can be memory-only. I thought it could be code

cameel commented 1 year ago

Technically it could, but I don't see any benefit in that. I.e. this would mean that we reserve space for them in the memory area where the constructor is building the runtime code and copy them there from calldata. This seems pointless because the runtime code would not be able to access them (the memory is there but you cannot access constructor's parameter outside of the constructor). You could achieve the same thing explicitly by declaring an immutable and copying the constructor parameter there yourself.

Currently constructor arguments can be memory-only.

Interesting. I actually did not know about that that limitation and I don't see why it has to be there. In #9514 it looks like the thing being fixed was parameters without location specified at all and there's no explanation why calldata was excluded. It was done in 0.7.0, which was the time when calldata parameters were already allowed anywhere.

Could be that it was just considered a corner case not worth handling - in the constructor we have the whole contract bytecode in calldata so the code finding the parameters has to account for that. @chriseth do you remember the context for that change?

k06a commented 1 year ago

@cameel as far as I can see in constructor CALLDATASIZE() is zero, I believe all the byte code is located in code, not in calldata.

ekpyron commented 1 year ago

During contract creation you have a copy of the init code (including constructor arguments) in calldata, so code and calldata will contain the same data. And since calldata is more efficient than code due to having calldataload, it'd make more sense to make them calldata really (unless you want to reuse functions expecting code variables)... the main reason why we don't allow constructor arguments to actually be calldata arguments is that that'd require some custom logic, since the arguments are at "weird" offsets (past the initcode, not at offset zero), and that having them in memory makes it easier to have a derived contract pass down on-the-fly-generated data to a base constructor (however, there's no reason for just only allowing to pass unchanged calldata further to base constructors in these cases, similar to passing calldata arguments to other public functions). So there's no real blocker against this.

So yes, generally both allowing code as well as calldata as locations for constructor arguments makes sense - the latter could be done even before the code data location is implemented. It mainly hasn't been done due to time-constraints and lack-of-demand so far. If you're interested in calldata constructor arguments, you can create a separate issue for it and make a case for their utility!

k06a commented 1 year ago

@ekpyron, strange, I see msg.data.length is zero in constructor (and same for assembly calldatasize):

contract A {
    event Log(uint256);
    constructor(uint256 a, bytes memory b) {
        emit Log(msg.data.length);
    }
}

=>

{
    "from": "0xf8e81D47203A594245E36C48e151709F0C19fBe8",
    "topic": "0x909c57d5c6ac08245cf2a6de3900e2b868513fa59099b92b27d8db823d92df9c",
    "event": "Log",
    "args": {
        "0": "0"
    }
}

ekpyron commented 1 year ago

@k06a Haha, that's fascinating, you're right! I didn't actually check the spec on this, but evmone, the execution engine we use for testing, does apparently implement this with incorrect behaviour, I wouldn't have thought that possible. I'll file an issue with them - they should pass the consensus tests, so I'm a bit puzzled about how this can be.

In any case, at least code as data location for them will definitely be possible in the future.

k06a commented 1 year ago

@ekpyron what do you think about idea to introduce new EIP for CODELOAD(OFFSET) operation to extract 32 bytes of code data to stack? This could be really helpful for data stored in code/immutable.

ekpyron commented 1 year ago

@ekpyron what do you think about idea to introduce new EIP for CODELOAD(OFFSET) operation to extract 32 bytes of code data to stack? This could be really helpful for data stored in code/immutable.

Yes, that would definitely be helpful - but it's hard these days to get an EIP accepted and live. There is still a chance that EOF may happen eventually - the latest EOF spec does have both a stack-based DATALOAD(offset) and an even cheaper immediate-argument DATALOADN (which would be used to implement the current immutable mechanism). For reference, this is the latest version of the EOF spec I'm aware of: https://notes.ethereum.org/@ipsilon/B1nnZ1fl3

It'd be quite nice if all of that actually went live at some point - it would make all contracts noticeably cheaper and have quite nice analysis properties - but the EIP process is complicated, straining and unpredictable, I wouldn't wager to bet on the chances nor the ETA at this point.

k06a commented 1 year ago

I see CODELOAD(offset)/DATALOAD(offset) is equivalent to the following code:

let tmp := mload(0)     // Store the value of memory[0]
codecopy(0, offset, 32) // Copy bytecode 32 bytes to memory[0]
let result := mload(0)  // Reading 32 bytes to stack
mstore(0, tmp)          // Restore the value of memory[0]

ekpyron commented 1 year ago

I see CODELOAD(offset)/DATALOAD(offset) is equivalent to the following code:

let tmp := mload(0)     // Store the value of memory[0]
codecopy(0, offset, 32) // Copy bytecode 32 bytes to memory[0]
let result := mload(0)  // Reading 32 bytes to stack
mstore(0, tmp)          // Restore the value of memory[0]

Yes - since we reserve scratch space, we could in practice even implement it as a plain

codecopy(0, offset, 32)
let result := mload(0)

Still quite a bit away from the presumed 3 gas a codeload(offset)/dataload(offset) would cost. But it's a good enough workaround as long as we don't have any additional opcodes.

k06a commented 1 year ago

@ekpyron if you would try to simulate this instruction in Yul/Assembly, you would definitely need to backup and restore memory[0:32].

ekpyron commented 1 year ago

We reserve the memory between 0 and 0x3f for temporary calculations, like for hashing (the keccak opcode is also only defined for memory, and storage slot calculation often involve hashing of single words). So we generally don't rely on the contents of the lowest two words of memory retaining their values long term. User-defined assembly also generally may not assume that that memory range remains unchanged between two blocks, if you have high-level solidity code between them. So for practical purposes, we don't need to save and restore by our conventions of memory use. If we'd want to simulate the opcode for general use, we'd have to, but given that that's more expensive due to saving and restoring in almost all cases, I'd not actually define a simulated instruction, but rather explicitly use scratch space for this.

k06a commented 1 year ago

I understand, but no one wants the following code to break unexpectedly:

mstore(0, typehash)
mstore(0x20, codeload(offset))
let hash := keccak256(0, 0x40)

ekpyron commented 1 year ago

Sure - which is why I wouldn't define a simulated codeload opcode. It'd be nice if we had it on the EVM level, but it's not an issue to live without it and manually copy to scratch space when necessary. But yeah, maybe not too productive to digress into too many details before the base feature of the data location is done.

k06a commented 1 year ago

BTW we could define it as a function when Solidity will support global assembly blocks:

function codeload(offset) -> value {
    let tmp := mload(0)
    codecopy(0, offset, 0x20)
    value := mload(0)
    mstore(0, tmp)
}

ethereum / solidity