Sync Call Abort Conditions

b1bart commented 5 years ago

Concern

I'm concerned about the fragility of contracts using this feature. It seems like there are plenty of abort cases that have no justification or supporting reasoning:

It looks like it aborts if:

the called contract doesn't have a proper exported function
the this is a re-entrant call
a call is made while there is an unread result on the transaction
the transaction is deferred
the callee does not know the function
the callee does not return a value
the callee does something illegal

Requiring out-of-band trust

1,2,3,6,7 and 8 are all controlled by a potentially third-party contract which presumably can shift underneath a given caller contracts feet. In the current spec the caller has no ability to protect themselves from a callee who decides to disrupt there operations meaning you have to establish out-of-band trust to make use of such a feature "safely".

I would propose that we alter this to allow the caller to handle errors caused by the callee perhaps by returning a known value on failure to the caller so that they can handle it (and aborting if they call get_sync_result after such a failure. This means the naive implementation will still abort as designed but a sophisticated implementation would have an option to recover.

read-only re-entrancy concerns?

I'm making an assumption that (2) was put in place to handle the famous re-entrancy hacks that other platforms have suffered. However, I am not convinced that a read-only re-entrancy attack can have the same issue. Granted, it could exhaust stack and abort the transaction but, there are several ways in the current proposal for a malicious callee to do that and if we add a way to handle errors they could handle these as well.

Do we have a concrete concern of re-entrancy issues within a read-only context?

Deferred Transaction concerns?

Do we have actual concerns with the use of this feature inside a deferred transaction or was this meant to add weight to deprecating deferred transactions?

References

https://github.com/EOSIO/eep-proposal-staging/blob/de46cb9396217b3d013ec81fe19cc68f91d9747a/eep-draft_synchronous_calls.md#L73-L76

https://github.com/EOSIO/eep-proposal-staging/blob/de46cb9396217b3d013ec81fe19cc68f91d9747a/eep-draft_synchronous_calls.md#L113-L117

[EDIT] I missed an abort condition but its slides into the broader category fine

tbfleming commented 5 years ago

We didn't include this because it dives into implementation detail: some of the restrictions enable an optimization where after the callee returns, it's kept in a zombie-like state, allowing get_sync_result to do a memcpy directly from the callee's linear memory to the caller's linear memory. If the optimization doesn't end up needing the entire restriction set, we can relax them.

I'm OK with replacing aborts with an error status, if it doesn't create implementation headaches (@arhag?). Note that inline actions will also cause aborts when the receiver doesn't support an action (upcoming CDT dispatcher feature).

Deferred transaction aborts: there's a ton of analysis work into verifying new features are safe when used inside deferred transactions. We're trying to reduce that load. Actually, I think we were ordered to reduce that load.

Recursion: we went round and round coming up with vulnerabilities and potential restrictions to solve them (unrelated to the above optimization). The restrictions snowballed into a very complex diagram about when contracts could do what when. It's why it states:

we may address those in the future

tbfleming commented 5 years ago

Hint on a major vulnerability: multi_index's insistence on caching everything.

b1bart commented 5 years ago

Deferred transaction aborts: there's a ton of analysis work into verifying new features are safe when used inside deferred transactions. We're trying to reduce that load. Actually, I think we were ordered to reduce that load.

@bytemaster is this a mandate going forward? all work will be gated by the disabling of deferred transactions?

Recursion: we went round and round coming up with vulnerabilities and potential restrictions to solve them (unrelated to the above optimization).

Did we document these anywhere? That is valuable information if we ever attempt to address them in the future.

it's kept in a zombie-like state

This seems more expensive to me than the memcpy to a holding buffer. a naive context can be 33MiB of memory and in terms of virtual memory for some of the other architectures we are considering it is substantially more. This seems like a premature optimization that is affecting the usability of the feature. Do we have any hard evidence to suggest that this is a worthwhile optimization to make? If not, considering how many other "problems" we may find during implementation that may restrict a spec, I'm not sure why we would opt to pre-commit to one such problem that may be imagined.

arhag commented 5 years ago

I would propose that we alter this to allow the caller to handle errors caused by the callee perhaps by returning a known value on failure to the caller so that they can handle it (and aborting if they call get_sync_result after such a failure. This means the naive implementation will still abort as designed but a sophisticated implementation would have an option to recover.

I like that. But the cautious way of handling this (and all the reentrancy and inconsistent state issues it could bring along) would be if we still disallow re-entering into the original caller contract within a read-only context and to clean up the WASM linear memory of all contracts in the call stack (other than the original caller) if such an abort occurs.

Do we have a concrete concern of re-entrancy issues within a read-only context?

The WASM linear memory of any instantiated accounts persists over the lifetime of an action (and this choice is to enable optimizations in the implementation). So even though the execution is in a read-only context, the contracts are still free to modify the linear memory and have that persist until the next time execution re-enters that instance because of a read-only sync call. The read-only context basically only restricts the contract from using some subset of the intrinsics.

So at least in theory a developer who doesn't know about the potential pitfalls of reentrancy can still write flawed code with the read-only sync call mechanism. For example, if contract A calls B which calls C which calls B, one can imagine the following scenario: in the first call to B it grabs the value of the first of two globals for which some invariant holds and then makes the call to C before then grabbing the value of the second global; the call to C results in calling B for the second time which may modify the values of the two globals (the invariant would hold for the two new values); but back in the first call to B after it called C and grabbed the second global variable value, it may use these two values with the assumption that the invariant holds even though it may not.

That was a contrived example, and maybe we don't care about protecting the developers from mistakes like this. And I'm struggling to remember the concrete example @tbfleming and I discussed that convinced me to give up allowing partial re-entrancy support at least for now just to be cautious (with the understanding that we could relax it later).

tbfleming commented 5 years ago

That was a contrived example, and maybe we don't care about protecting the developers from mistakes like this. And I'm struggling to remember the concrete example @tbfleming and I discussed that convinced me to give up allowing partial re-entrancy support at least for now just to be cautious (with the understanding that we could relax it later).

Contracts which hold a multi_index in a singleton.

arhag commented 5 years ago

Contracts which hold a multi_index in a singleton.

Yes, but I'm still not remembering how that became a problem with reentrancy allowed in a read-only context.

tbfleming commented 5 years ago

We had to either:

Ban the first contract from being called, or
Spawn a 2nd instance, which could get confused by incomplete database changes

tbfleming commented 5 years ago

Plus, if we allow recursion now, it could limit our future attempts to come up with read-write sync calls.

arhag commented 5 years ago

Ban the first contract from being called,

Is this because we were concerned about the sync call originating in the middle of a database mutating operation within multi_index?

Spawn a 2nd instance, which could get confused by incomplete database changes

Is this regarding breaking invariants among different table rows if the sync call is between the two mutations?

tbfleming commented 5 years ago

Is this because we were concerned about the sync call originating in the middle of a database mutating operation within multi_index?

Yes, or the contract otherwise getting confused because its state is in the middle of changing.

Is this regarding breaking invariants among different table rows if the sync call is between the two mutations?

Yes

arhag commented 5 years ago

In that case it wouldn't be enough to just ban the first contract from being called, as my earlier contrived example shows.

And this goes back to how strong of a protection do we want to provide to the developer from shooting themselves in the foot. Currently, we have chosen a very strong level of protection (though perhaps not the strongest because we allow subsequent calls to the same contract instance which may have been designed with the assumption that their memory would be reset after each call) at the cost of the usability of the feature. Though again, we retain the flexibility to weaken the level of protection in the future (I presume in an opt-in way?) to make the feature more powerful.

tbfleming commented 5 years ago

We went with a full memory reset for each call for this proposal. That, plus the recursion limit, prevents all bad cases we could think of, except this one:

A read-only function in a contract reading the rows of the first contract

b1bart commented 5 years ago

The WASM linear memory of any instantiated accounts persists over the lifetime of an action (and this choice is to enable optimizations in the implementation).

This sounds like a really bad idea. It changes the mental model for the developer from an isolated "read view" of the data to something wholly different. Are we sure we need to push this level of complexity to the developer? This optimization is dictating so many caveats in the "proper usage" of this feature that I question when we have pre-imposed it on the spec rather than implementing to a broader design and restricting once we know the optimizations that this enables bear fruit.

Prior to Dawn 4.0 we had a threaded optimization that hid the start-up and sanitization costs of wasm execution environment. We dropped it for reasons I cannot remember but it was working and could reduce the CPU overhead of isolated contexts for synchronous calls quite a bit. In the future, the presence of concurrent transaction retirement would also allow us to keep cores busy while some of these book keeping tasks are performed.

I believe the simplest spec to understand and implement is the naive one where we fire up isolated contexts. With actual data we can make an informed decision about whether specs need to be materially changed as a result of intractable cpu costs.

arhag commented 5 years ago

We went with a full memory reset for each call for this proposal.

That's news to me. So if we are willing to pay the performance cost of that, why not make read-only sync calls way more usable by removing the reentrancy restrictions. The recursion limit puts a bound on the number of WASM instances that may exist at one time.

tbfleming commented 5 years ago

That's news to me. So if we are willing to pay the performance cost of that, why not make read-only sync calls way more usable by removing the reentrancy restrictions. The recursion limit puts a bound on the number of WASM instances that may exist at one time.

That brings us to:

Prevent the first contract from being called, or
Warn contract authors about consistency issues

arhag commented 5 years ago

The first contract could be called. It would be a separate WASM linear memory than that of the original caller.

I believe the only issue that remains is the one you pointed out which is

A read-only function in a contract reading the rows of the first contract

when it is doing so in the middle of mutations of multiple rows that are supposed to hold some invariant.

And I think that is fine to just warn contract developers about.

tbfleming commented 5 years ago

Also: a read-only function in the first contract reading its own rows

arhag commented 5 years ago

Well "a contract" includes "first contract" 😉. But sure we could get explicit about it for reinforcement.

tbfleming commented 5 years ago

How about: "A read-only function should not attempt to read the rows of the contract which is processing the action, even if the read-only function belongs to the same contract. If it does, then all the authors of all contracts involved should carefully analyze consistency."

tbfleming commented 5 years ago

"Determining which contract is processing the action is left as an exercise to the reader."

I guess we need another intrinsic: get_action_handler?

arhag commented 5 years ago

"A read-only function should not attempt to read the rows of the contract which is processing the action, even if the read-only function belongs to the same contract. If it does, then all the authors of all contracts involved should carefully analyze consistency."

Maybe... but that does put a burden on the implementor of the read-only function to know how to handle the case where the data it needs is held by the account processing the current action.

By the way, would the receiver (the contract running the action which originally made the read-only sync call) even be available in the context of a read-only function execution? Do we still want to call it a receiver and distinguish it from a self (the account hosting the contract actually running the code that the caller selected)? Are these allowed in notification contexts where we then have the further distinction between first_receiver, receiver, and self?

We could track the data that was read and abort the transaction if the data that was read was modified. But that might be too constraining. And also doesn't handle other invariants that, for example, involve particular data not existing.

tbfleming commented 5 years ago

first_receiver, receiver, and self?

There's already a ton of confusion out there about these. :(

arhag commented 5 years ago

This requires special support in the state database, but what if the read-only calls were always in the context of the state at the start of the action?

tbfleming commented 5 years ago

I like that idea

tbfleming commented 5 years ago

That, together with write conflict checking, could also be a solution to read-write sync calls.

tbfleming commented 5 years ago

Scratch read-write; I see it causing a different kind of confusion. It's too far removed from what people tend to think of with synchronous models; e.g. write X then read X would fail.

tbfleming commented 5 years ago

Hmmm. Contract A write X then read X through a read-only function would also violate typical synchronous models.

arhag commented 5 years ago

Hmmm. Contract A write X then read X through a read-only function would also violate typical synchronous models.

Yes, but maybe we put the burden on contracts executing in regular (not read-only) contexts to avoid making read-only sync calls that will ultimately call into themselves. Or if they do, they need to be aware that their typical assumptions about synchronous models may be violated.

arhag commented 5 years ago

Also, people already get confused with inline actions.

tbfleming commented 5 years ago

If we place that burden on them, then we could avoid changing database semantics during read-only calls.

tbfleming commented 5 years ago

Also, people already get confused with inline actions.

I think it's the async nature that throws them off.

arhag commented 5 years ago

If we place that burden on them, then we could avoid changing database semantics during read-only calls.

We could, but by changing the database semantics during read-only calls, there is less of a burden. As long as you maintain invariants after your action executes, you would know that read-only functions would never access tables that are in a state that do not satisfy these invariants.

tbfleming commented 5 years ago

true

tbfleming commented 5 years ago

I guess the question becomes: do we make this proposal dependent on a database change which supports these semantics efficiently?

b1bart commented 5 years ago

Can someone succinctly explain the consistency issue? Are you in the callback for our contract side multi index, you could have mutated the data and then called a sync read only where those changes or something implicit like secondary indices are inconsistent? Or is there something more insidious?

On Fri, Jun 14, 2019, 5:42 PM Todd Fleming notifications@github.com wrote:

A guess the question becomes: do we make this proposal dependent on a database change which supports these semantics efficiently?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/EOSIO/eep-proposal-staging/issues/7?email_source=notifications&email_token=AMD3YDT6RAU7PZUQLKG6VITP2QGCZA5CNFSM4HYMOCKKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXYFH3I#issuecomment-502289389, or mute the thread https://github.com/notifications/unsubscribe-auth/AMD3YDXQ7I7UBXWW3XZUOWLP2QGCZANCNFSM4HYMOCKA .

tbfleming commented 5 years ago

Contract A:

write a row
write a row
read-only call that (eventually) recurses to A
write a row

Assume it takes all 3 writes to move from 1 consistent state to another. The read-only function could be confused by the inconsistent state.

b1bart commented 5 years ago

That seems like a concern at the application layer to me and, one that we haven't really fixed with the given abort conditions anyway. It is still entirely possible to create that inconsistent state and read it from any depending contract (including the initial caller). The only actor in this situation that can guarantee this type of consistency is the application developer and they are also the actor responsible for issuing the synchronous call.

To me this is good alignment and we do not need to assume that since some application developers, who are necessarily creating complex higher level consistency invariants, are unable to correctly process it that it should be removed from the toolbox for all developers.

Most of the use cases will be simple, have simple consistency rules and not need to be concerned with read-only re-entrancy if that re-entrancy is isolated properly.

Thats also a really easy model to wrap your head around. Sure, it may have removed some coy optimizations we may be able to make but i prefer to make it work and make it right before we worry about making it fast.

tbfleming commented 5 years ago

I removed the recursion restriction. I also removed several failure conditions; these can come back during implementation if needed. All cases which would cause aborts (except running out of time) now return a status to the caller instead.

I kept the clearing-memory requirement. multi_index will misbehave without that requirement.

Assume contract B holds a multi_index in a singleton.

Contract A:

Calls B. B reads some of A's rows.
Modifies rows
Calls B. B reads some of A's rows. multi_index will cache the old version of the rows it already read, but retrieve the new version of the rows it hasn't read.

Contract authors have no reason to suspect multi_index behaves this way, so could hit this case even after they took care to not call any read-only functions in the middle of a change.

Clearing memory prevents this case from happening.

b1bart commented 5 years ago

I don't actually see the clearing-memory requirement called out?

Regardless, if we treat each calling frame as a separate isolated context from the point of view of semantics doesn't this handle that concern and leave "how" we create that isolated context to the implementation?

A calls B creating context-1, B reads rows
A modifies rows
A calls B creating context-2, B reads rows correctly

multi-index caching is just one way in which a shared context can create nonsensical interactions. I don't see how we present an understandable set of semantics without the concept of isolated and fully initialized linear memory. Without that, I feel like we have to rename the concept away from "read-only" as it may deceive developers into thinking they don't have to worry about re-entrant interactions.

This also handles recursion:

A calls B creating context-1
- B calls A creating context-2
  - A calls B creating context-3

Changes to any of the linear memory during context-3's execution shouldn't affect execution of context-1 after returning from the sync call.

tbfleming commented 5 years ago

In "Intrinsics and Entry (callee)": "The system initializes the WASM linear memory before calling this entry point. It does this for each call."

tbfleming commented 5 years ago

Regardless, if we treat each calling frame as a separate isolated context from the point of view of semantics doesn't this handle that concern and leave "how" we create that isolated context to the implementation?

Yes

EOSIO / spec-repo