ethereum / solidity

Solidity, the Smart Contract Programming Language
https://soliditylang.org
GNU General Public License v3.0
23.07k stars 5.72k forks source link

Array literals should be convertible to both dynamically-sized and statically-sized arrays #11879

Open chriseth opened 3 years ago

chriseth commented 3 years ago

We could implement this using a new type for array literals and the codegen could store the elements in memory while reserving a memory slot for the length (and just not use it for the static case).

chriseth commented 3 years ago

This could also solve the weird type deduction for array literals: Currently, [1,2,3] is a uint8[3] and uint[3] t = [1,2,3]; is invalid. If we do not assign a "proper" array type to the array literal expression right away, but rather keep it more similar to how tuple types currently work, this should be much easier to use.

ekpyron commented 3 years ago

We don't need to (nor want to?) allow naming these types, do we? Then it would only be usable in contexts in which we could deduce whether it will end up dynamically or statically sized, so code generation would already know and be able to handle the cases separately without reserving slots and the like.

chriseth commented 3 years ago

That is true, but we would need to pass that through the type checker, which is not always easy to do.

ekpyron commented 3 years ago

At one point or another, the type checker needs to check if the array literal expression is convertible to the type it'll end up in and can just annotate that then, can't it? It'd even consistently work for plain [1,2,3]; or even for [f(), g(), h()];, which wouldn't be annotated by any concrete array type, meaning that nothing needs to be stored anywhere and the subexpressions can just be evaluated and the results discarded in codegen...

chriseth commented 3 years ago

((([1,2,3]))) - it can be arbitrarily nested.

chriseth commented 3 years ago

Oh and there also is [[1,2], [4,5]]

ekpyron commented 3 years ago

Ok, maybe it's nontrivial :-). But should still be possible, shouldn't it :-)...?

And those nested single-component tuples are quite annoying everywhere, maybe we should just get rid of that once and for all :-D. But that's a different topic.

chriseth commented 3 years ago

Implementation note: Should introduce a new type similar to TupleType (maybe that one can be re-used with an additional boolean) which has one separate type per component. In TypeChecker::visit(TupleExpression, most of the checks can be removed. The type of the expression should be the new inline array type, where each component is just the type of the component.

Convertibility of this new type should be to statically-sized and dynamically-sized array type where each component type is convertible and the size matches.

Code generation should store each component in memory (and probably already convert to the target type).

Since we already need to convert to the target type anyway, we can maybe use @ekpyron 's idea of forwarding the target conversion type.

So it might be that this gets a bit more complicated.

chriseth commented 3 years ago

Forwarding the target conversion type will be very tricky, since there are too many situations in which this can happen, in my opinion.

In summary, I think we can only achieve the following without considerable effort:

We add an additional flag to ArrayType that is valid for statically-sized arrays in memory and tells that it is fine to convert this array into a dynamically-sized array. The flag is only set for array types deduced from TupleExpressions that are inline arrays for now.

Such array types are convertible to other such array types if the size matches (this is needed for two-dimensional inline arrays). They are convertible to statically-sized array types of the same size or dynamically-sized array types.

In code generation, they are handled as dynamically-sized memory arrays.

When converted to statically-sized memory arrays, we just add 0x20 to the pointer, while conversion to dynamically-sized memory arrays is a no-op.

The .length member and index access has to be done accordingly.

Another option to implement this is to make mobileType() to be the dynamically-sized memory array. I think this could require less implementation effort, but I'm not sure.

The difference between the first (A) and the second (B) way to implement it is that:

In (A), the expression [[1, 2], [4, 5]] can be converted to uint8[2][2], while it can only be converted to uint[][2] in (B). On the other hand, [[1, 2], [3]] is valid in (B), while it is invalid in (A).

ekpyron commented 3 years ago

I still don't see the problem in determining up front whether the array is static or dynamic.

We do type checking - if it's possible to type-check these things, we have to know which type it has to end up in in order to validate that it's convertible - so we do know the type statically up front anyways in that sense. The worst thing that can happen is having to store it in an additional annotation and/or pass it down through a few nesting levels just as we do for the one-element-tuple-things at a lot of places anyways...

But I may be missing something and if I'm wrong and that really is problematic to do, the reserved length slot is ok-ish I guess.

chriseth commented 3 years ago

We can do it, yes, but I fear it might cause more problems than it solves, not only in using f for uint[]; / [1,2,3].f().

ekpyron commented 3 years ago

I don't see how the implementation detail for code generation affects stuff likeusing.

chriseth commented 3 years ago

Ah then I misunderstood you, I thought you wanted to use that for type checking as well, like determining the type of [[1,2], [4,5,6]]

ekpyron commented 3 years ago

My point is that in any situation in which we actually need to write anything to memory (or anywhere), we will have had to typecheck the array literal already and thus know whether it's dynamic or static. So we can store that in an annotation and have code-generation generate properly typed code for it.

ekpyron commented 3 years ago

Whenever we don't have that, e.g. in [expr1,expr2,expr3];, we can just generate code as if expr1; expr2; expr3;, i.e. it doesn't matter if things are static or dynamic, because there really isn't any array anyways :-).

cameel commented 3 years ago

Looks like there's already an older issue about this: #9645.

nventuro commented 2 years ago

Stopping by to enthusiastically voice support for this feature: creating simple arrays is extremely un-ergonomic (see e.g.: https://github.com/balancer-labs/balancer-v2-monorepo/blob/9e7a59e2281c1c182f1c604458a1391c0dbe44df/pkg/vault/contracts/authorizer/TimelockAuthorizerMigrator.sol#L145).

Is there an estimated timeline for this feature to be complete?

cameel commented 2 years ago

Yeah, we're aware of this particular pain point and we get a lot of complaints about it. It's one of the tasks on our roadmap for Q2 2022.

chriseth commented 2 years ago

I'm on @ekpyron 's side now. Maybe I just don't see the problems anymore, though...

NeverFearTomorrow commented 2 years ago

Is there an ETA on this?

drortirosh commented 9 months ago

currently, there is a "int_constant" type, but when creating an array, it is reduced to "uint8". instead, should have an "array_constant" (with members which are "*_constant"). The array is reduced to an actual type when used, either to a dynamic or static array.

robercano commented 7 months ago

I'm also very interested in this issue. Are there any updates on it? Some way we can help moving it forward? Thanks!

cameel commented 7 months ago

This is unfortunately blocked, pending improvements in the optimizer so we have to postpone it until we have those. That will probably only happen on top of the new type system, which right now takes priority.

Here's the summary from the PR where we last attempted to implement it: https://github.com/ethereum/solidity/pull/12883#issuecomment-1174763029

The problem is the memory tracking inside DataFlowAnalyzer. For every memory write, it has to check if the memory write would overwrite any other memory state it has stored inside m_state.memory. If the memory mapping stays large, this can be quite expensive. Two ways to overcome this come to my mind right now:

  • only make use of the memory mapping if the optimizer step deriving from DataFlowAnalyzer actually uses it
  • limit the size of the memory mapping
drortirosh commented 7 months ago

I don't understand why it is an optimizer issue: the conversion from an untyped array to a typed (dynamic or static) array should be done by the compiler, even if the optimizer is completely disabled. e.g. uint[3]a = [1,2,3]; vs uint[] a = [1,2,3];

(the same is true for the conversion from <const int> to uint256)

cameel commented 7 months ago

It's because of how the IR codegen is designed. In the legcacy codegen we'd just allocate memory and write each element already at the target location. Which sounds simple, but in practice results in pretty complicated code with tons of special cases due to conversions, cleanup, multiple possible locations with different data encoding (memory, storage, transient storage), implicit conversions between nested array types, etc.

The approach in the IR codegen is to emit the most straightforward code and let the optimizer deal with it. So in this case it was basically to process each element independently, putting it in its own Yul variable and use the existing cleanup/conversion helpers that already have the logic to deal with all the complexity. Such code should be much easier to be proven correct and then the optimizer can deal with the redundancy using known and well tested transformations.

Unfortunately it turned out the unoptimized code is very likely to run into Stack Too Deep or even overflow the whole stack (which can hold at most 1024 elements) for large arrays. While we do have the stack-to-memory mover now, we don't want rely on it here because moving items to memory and them moving them again would be very inefficient way to do this. The optimizer (or Yul->EVM transform) has to be smarter about it.

Another problem, already mentioned in https://github.com/ethereum/solidity/issues/11879#issuecomment-1934167004, is that tracking so many memory writes would significantly slow down our dataflow analyzer, which is used by many optimizer steps. Even if we deemed the emitted code acceptable, it would make the optimizer even slower than it is now. We have to prevent that to make the feature viable.

drortirosh commented 7 months ago

I still think it is the wrong place. a fixed-size array is not "an optimization" over a dynamic array (of known size) - it is a completely different type, that only happens to have the same syntax at the source level. I'm not familiar with the compiler's architecture, but to my understanding, after initial parsing of the source tree (and syntax checking), it builds an AST, and out of that, it generates a YUL representation. in such a model, the array assignment (in the AST tree) is converted from a generic type to an explicit type, as required by the context. only after this AST transformation, we start a pass to convert it into yul.

ekpyron commented 7 months ago

This is indeed not an optimizer issue. The initial attempt for doing this was to still always treat array literals as statically sized and to merely allow an implicit conversion to dynamic arrays - and for that conversion we'd then have had to explicitly generate the actual conversion code - which then did become an optimizer issue, since we'd need the optimizer to generate code as if we had properly determined the type as dynamically sized from the start (nobody will want this to be as expensive as creating a static array first and then copying things over to a dynamic array).

I had argued from the beginning that we should rather consider array literals as "midly polymorphic", i.e. to allow treating them themselves as either dynamically or statically sized depending on the context as determined by type checking, and then to generate the appropriate code in the first place. That way this could indeed be solved without any involvement of or improvement in the optimizer.

So to be clear: the previous approach for implementing this failed to work out due to it relying on better optimizations. The approach I advocated from the start, though, could still be done. For that it's merely a matter of us finding resources for actually doing it (it's still non-trivial, since as of now, all expressions have a fixed type independent of their context and that invariant would break here).