9.4.3. Jump Destination Validity - "code" means data accessible by Program Counter or even non-executable data?

loredanacirstea commented 2 years ago

9.4.3. Jump Destination Validity. We previously used D as the function to determine the set of valid jump destinations given the code that is being run. We define this as any position in the code occupied by a JUMPDEST instruction. All such positions must be on valid instruction boundaries, rather than sitting in the data portion of PUSH operations and must appear within the explicitly defined portion of the code (rather than in the implicitly defined STOP operations that trail it).

Consider the following assembly code:

dataSize(sub_0)
dataOffset(sub_0)
0x00
codecopy
dataSize(sub_0)
0x00
return
stop

sub_0: assembly {
  tag1
  jump
 stop
 stop
  tag1:
  0xeeeeeeee
  0x00
  mstore
  0x20
  0x00
  return
}

This translates to 0x601461000e60003960146000f3006100065600005b63eeeeeeee60005260206000f3. If this contract is deployed and it receives a transaction/call, it returns 0x00000000000000000000000000000000000000000000000000000000eeeeeeee. An example of this execution was ran at https://ropsten.etherscan.io/tx/0x6c300851fc63bb91829041acddd49c440720ca095061f4d164e53e9abce4d781.

Consider that we change the above bytecode by replacing the 2nd stop instruction with PUSH4 (0x63). We obtain 0x601461000e60003960146000f3006100065600635b63eeeeeeee60005260206000f3. If this contract is deployed and it receives a transaction/call, it reverts with Invalid JUMP destination: https://ropsten.etherscan.io/tx/0x81c674bb22b7171f1f0bd3ee98b09c35ef5e9874c7fc232821eb1b9542a363c0. This happens even though the 0x63 (PUSH4) instruction is never reachable, therefore not executable.

This behavior is not obvious, not even from the description of the Yellow Paper and I doubt it is the intended behavior. Nonetheless, clients are implementing it https://github.com/ethereumjs/ethereumjs-monorepo/blob/93c4c4ad182d4b4be54b6346aab70f615fb21bb8/packages/vm/src/evm/interpreter.ts#L263-L283.

holiman commented 2 years ago

I don't see what is unclear here. Consider the code as code or data, where data are the bytes following PUSHX. If we mark the data sections with x, then we have: 0x601461000e60003960146000f3006100065600005b63eeeeeeee60005260206000f3 -> 0x60xx61xxxx60xx3960xx60xxf30061xxxx5600005b63xxxxxxxx60xx5260xx60xxf3

The example code from the ropsten tx you linked: 0x6100065600635b63eeeeeeee60005260206000f3:0x61xxxx560063xxxxxxxxeeee60xx5260xx60xxf3

Even though the second stop is not visited, it was still just a 1-byte operation. If you replace it with a 0x63, then it is still an unvisited op, but it also turns the next four bytes into data, among them tag1 which is now no longer a jumpdest, but a "jumpdest-looking byte inside pushdata".

Clients have implemented the distinction between "code" versus "data after PUSH" from genesis, it's nothing new. It is most definitely intended. The intention is to provide code which "mutates" by offsetting the execution by jumping into pushdata, and suddenly executing sequences that behave differently because of it.

loredanacirstea commented 2 years ago

I understand your narrative. However, allow me to present a sum of arguments that may persuade you to see with increased clarity my objection:

In short: The EVM is intended as a simple, unopinionated (therefore flexible) base layer with the rules of a stack machine. The stack machine rules are applied opcode by opcode. But in this above case, you add additional rules outside the concerns of the stack machine, even before the bytecode is run. Something that is intended as unexecutable data now can become code, because it contains something that is parsed like a PUSHX, but it is not.

In the genetic code, for example: there are many cases when a gene has a common code with another gene in such a way that the same code is understood differently in each context. In such case, the control mechanisms modulate the execution. But their rules are not part of the underlying rules of the genetic machinery (while fully conforming to them). The genetic machinery has simple and unopinionated rules.
From the semantic point of view: Data can be code or non-executable data. When data is not executed, it should not be understood as opcodes (I did that in the initial question only for legibility). So STOP and PUSHX cannot be correctly inferred there, from 0x00 and 0x6X. I am aware that such 2-stage parsing may optimize execution, but at the cost of semantic purity. Plus: checking should probably not be part of the ground-level rules (see points 1 and 6).
From the economics point of view: when we need unexecutable data in the bytecode we are forced to check at compile-time for eventual PUSHX and leave X empty spaces from the executable code.
Maybe this is a good opportunity to question further:JUMPDEST seems already a waste of 1 opcode that creates additional semantic and checking problems. It does not guarantee that assets will be handled to conform to the intent of the code caller. It can be added in the assembly for readability, but stored? Probably another waste. Plus: it forces the interpreter to do a 2-pass. Presently we are more sophisticated in the area of verifiers. We may not need the crutch of this un-elegant mechanism. Let me know if I need to detail this argument. (Or maybe another (related) issue should be filled about the JUMPDEST opcode)
From the point of view of intent: I programmed recently a lot of bit mappers and therefore needed data stored in bytecode and jumped over at execution. Since I program in assembly, I can see why the Solidity programmers would not care about the details (and correctness) of the EVM itself. I did not expect that non-executable data would encroach upon the code. It is about time to circle back and have a discussion about "traditional implementations" vs integrity of intent.
From the point of view of separation of concerns: the yellow paper seems to mix computation engine and verification in this particular fragment. It is very much like including too many laws into the constitution.

loredanacirstea commented 2 years ago

@holiman, since you have not answered, it compelled me to do a little bit of digging. And I found https://github.com/ethereum/EIPs/blob/3ca4f87938f0295a06fa07eb530b4f3cd7774fcd/EIPS/eip-3540.md, which says exactly what I say above:

On-chain deployed EVM bytecode contains no pre-defined structure today. Code is typically validated in clients to the extent of JUMPDEST analysis at runtime, every single time prior to execution. This poses not only an overhead, but also a challenge for introducing new or deprecating existing features.

The first tangible feature it provides is separation of code and data. This separation is especially beneficial for on-chain code validators (like those utilised by layer-2 scaling tools, such as Optimism), because they can distinguish code and data (this includes deployment code and constructor arguments too). Currently they a) require changes prior to contract deployment; b) implement a fragile method; or c) implement an expensive and restrictive jump analysis. Code and data separation can result in ease of use and significant gas savings for such use cases. Additionally, various (static) analysis tools can also benefit, though off-chain tools can already deal with existing code, so the impact is smaller.

A non-exhaustive list of proposed changes which could benefit from this format:

Including a JUMPDEST-table (to avoid analysis at execution time) and/or removing JUMPDESTs entirely. Introducing static jumps (with relative addresses) and jump tables, and disallowing dynamic jumps at the same time. Requiring code section(s) to be terminated by STOP. (Assumptions like this can provide significant speed improvements in interpreters, such as a speed up of ~7% seen in evmone.)

Hence, our colleagues @axic, @gumb0, @chfast know very well that while code is data, not all data is code. This is very good.

holiman commented 2 years ago

Well, I didn't answer, because I didn't know what the argument was. Are you arguing for 1. Something to be clarified in YP?2. Something to be changed in Ethereum?Your first post seemed like you saw implementations doing something wrong (ethereumjs), and I just chimed in why there's nothing wrong with what they're doing, and nothing misunderstood. I agree with the EOF eip and the separate code and data. ( nevertheless, immediate data (following push) will still be in the code-section )What is it you want to achieve?

loredanacirstea commented 2 years ago

@holiman What I want to achieve is semantically correct behavior. Either: 1) Be explicit with the definition of code vs. data: https://github.com/ethereum/yellowpaper/pull/842 2) Do not be opinionated on how the upper layers must treat JUMPDEST: https://github.com/ethereum/yellowpaper/pull/843

Of course, I would prefer 2). The main reason is: it is harder for the EVM to verify itself if you impose a pre-execution analysis. And it may even be a non-breaking change to fix this.

holiman commented 2 years ago

That would require a hard-fork though. So the YP cannot be changed here.

ethereum / yellowpaper

9.4.3. Jump Destination Validity - "code" means data accessible by Program Counter or even non-executable data? #840