Preventing corrupted program input

lrettig commented 1 week ago

It's trivially easy to accidentally corrupt program input. All it requires is passing the wrong data types, or extra data, or data that's out of order, into the program stdin. Here's a very simple example: https://github.com/athenavm/athena/compare/main...input-corruption-example

By simply adding this:

https://github.com/athenavm/athena/blob/4201b627bf53a57430e593bab15820c33e97f991/examples/wallet/script/src/bin/execute.rs#L77-L78

We get something like this:

RUST_LOG=info cargo run                     
...
     Running `athena/athena/examples/target/debug/execute`
...
spawned a wallet program at 273711267afa1062ae0e01c2dbfe0093d6dc9696f0d9f1ed for 000dfb23b0979b4b000000000000000000000000000000000000000000000000
sending 10 coins 010101010101010101010101010101010101010101010101 -> 030303030303030303030303030303030303030303030303
...
stdout: Sending 217020518514230019 coins to [0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]
2024-10-18T20:51:54.058749Z  INFO call{msg=AthenaMessage { kind: Call, depth: 1, gas: 24931856, recipient: [0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3], sender: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], input_data: Some([0, 0]), value: 217020518514230019, code: [] } id=140723045993960 depth=1}: athena_interface: msg=AthenaMessage { kind: Call, depth: 1, gas: 24931856, recipient: [0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3], sender: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], input_data: Some([0, 0]), value: 217020518514230019, code: [] }
thread 'main' panicked at wallet/script/src/bin/execute.rs:112:6:
sending coins: SyscallFailed(InsufficientBalance)

This is a really big footgun: this sort of mistake is easy to make and hard to detect. Can we put some guardrails in place here?

Is there any way to detect the corruption when deserializing the input data and report the issue there?

Can we go even further and store type information along with the raw bytes in stdin so we can do some type matching when reading?

Even just checking the length of the input data, as a simple sanity check, would be an improvement.

poszu commented 1 week ago

Generally, the program might choose whatever serialization method it wants. We use scale but it's not enforced anywhere, so it might not be possible to have a generic way.

Even just checking the length of the input data, as a simple sanity check, would be an improvement.

We could check if all bytes were consumed from the input.

lrettig commented 1 week ago

We could check if all bytes were consumed from the input.

This is a great starting point and seems like an easy win - although wouldn't this happen after the sort of error I'm describing above? In other words, wouldn't we have to "peek" at the length of the input before execution?

the program might choose whatever serialization method it wants

This is true and indeed this is the point of our template system and of account unification, but we can still offer a reasonable set of defaults, so I don't think this is a reason not to try to do better. You're right that maybe we can't do it in the core, but we can do it in the template itself.

The AI recommends using a serialization scheme like protobuf that allows encoding of type information. An alternative is to wrap structs in a typed message and/or use enums, e.g.:

#[derive(Serialize, Deserialize)]
#[serde(tag = "type")]
enum Message {
    SpendArguments {
        recipient: Address,
        amount: u64,
    },
    // Add other message types here
}
let message = Message::SpendArguments {
    recipient: /* recipient address */,
    amount: 1000,
};

let serialized = serde_json::to_string(&message)?;

I think this isn't a bad idea.

poszu commented 1 week ago

The AI recommends using a serialization scheme like protobuf that allows encoding of type information. An alternative is to wrap structs in a typed message and/or use enums, e.g.:

I don't think using something like JSON and protobuf is a good idea if we care about:

the size of encoded data
encoding consistency (given data should always serialize to exactly the same bytes no matter of software or system used to serialize it), which is important for example for hashing.

lrettig commented 1 week ago

We could check if all bytes were consumed from the input.

This is a great starting point and seems like an easy win - although wouldn't this happen after the sort of error I'm describing above? In other words, wouldn't we have to "peek" at the length of the input before execution?

Gave this some more thought and realized it's not possible. We can know the length of the input stream before executing the program, but we cannot know how many bytes the program will read without actually running it.

The only alternative I can think of is to require the program to self-declare up front which data it will read and in which order. That would sort of address the problem, but it puts the burden on the developer, which I don't like.

I don't think using something like JSON and protobuf is a good idea if we care about

The ideal here is a serialization codec that has cross-platform support, uses a compact encoding, and is at least partially self-describing. Or we can possibly manually add the description along the lines I described above. Will give this some more thought.

lrettig commented 5 days ago

the program might choose whatever serialization method it wants

Note that, for now at least, we've effectively enshrined SCALE here:

https://github.com/athenavm/athena/blob/244d1aaf0b3fc9627f071af7f12985b278160a72/vm/lib/src/program.rs#L5

Yes, a program could still choose to use a different encoding, but that would mean using a different SDK to write and compile the program, which is pretty unrealistic.

athenavm / athena

Preventing corrupted program input #162