Ag Solo wallet recovery plan

warner commented 3 years ago

What is the Problem Being Solved?

What is the minimum amount of information necessary to back up the state of an Agoric wallet / user agent? How frequently do you have to update your copy of that data? What will be the process to convert that backup data into a new fully-functioning user agent?

In the Bitcoin world, knowing the BIP32 "hierarchical deterministic wallet" passphrase is sufficient to generate all the private signing keys you've ever used to sign transactions. From the corresponding public keys, you can search the UXTO set for any BTC that you control, and thus both calculate your balance and regain the ability to spend it. The 12-ish word passphrase is the only piece of data you need to retain, and it doesn't ever change. (before BIP32, when bitcoin clients randomly generated new keypairs on a regular basis, people lost a lot of BTC despite having backups, because their stale backup did not include newer keys, and did not have enough information to derive them).

In the Ethereum world, the same passphrase lets you get to one (or more, but usually just one) account. This account has some amount of ETH in it. That signing key may also authorize you to control some number of secondary (ERC20) tokens: there is no direct way to enumerate all the tokens your key might control, but it's trivial to query any particular (token, account) pair for a balance, so with your passphase and merely a vague memory of owning some DAI, you can quickly find out exactly what you have. There are a lot of ERC20 tokens, but not so many that it would be difficult to scan all of them for a potential balance. There are more complex contracts which don't fit within the ERC20 framework, but in general they don't expect a lot of client-side state, so with the private key and the ability to query the chain, you can probably reconstruct everything from a backup pretty easily.

For an arbitrary ag-solo machine, various vats within the ag-solo will have access to various objects in vats on one or more chains. The ag-solo has a private key, which it uses to sign messages sent to those other on-chain swingsets, and those messages are translated through a c-list that is specific to that key, and that c-list grants access to objects which represent Purses (and promises and invitations and all sorts of other complex logic).

If we lost the state of the ag-solo, but retained the private key, it would still be pretty messy and difficult to try and rebuild the ag-solo. We could theoretically scan the chain-side kernel state vector for our c-list, which would give us a list of indices that we have control over. We might even be able to parse the kernel tables and the other vat's c-lists to figure out which of these correspond to Purses from some small number of known Issuers. But this is exactly as much fun as recovering your word processor document by grepping the raw disk sectors after overwriting the index tables.

cc @michaelfig @dtribble @rowgraus

Description of the Design

I'm not at all sure what this will look like, but here are a few ideas we (@dtribble @rowgraus) kicked around today:

define an "agoric account" as the thing you create during initial provisioning, and which contains all the things you expect to keep around long-term, like a list of Purses that make up your wallet, and maybe a public deposit facet, etc
- currently we use the local-side home object to hold a lot of things, and home.wallet holds your purses
use an on-chain vat to hold all of the account contents
- this might be as simple as moving home on-chain, or maybe just home.wallet
your provisioned client gets access to that on-chain object. any time you add a durable object to your stash of account stuff, it needs to be shared with the on-chain backup as well
the on-chain stash can be configured with one or more public keys that are authorized to recover the contents
- you create these ahead of time, maybe automatically when creating an account
- the passphrase for this key is the long-term backup secret
the provisioning process either creates a new stash, or recovers access to an existing stash
some low-level mechanism is used to have the recovery key sign a message that cites the new client being provisioned in recovery mode
- this might be a special cosmos-sdk message, a variant of our current provisioning flow, which is signed by the recovery key but names the new client key

If you use the recovery flow, some ag-solo wallet-creation mode asks for the recovery passphrase, and uses it to regain access to the stash. You wind up with an entirely new user agent, nobody uses the old agent's privkey again, but the new agent gets the same home.wallet or whatever as the old one had.

@dtribble suggests that using the recovery flow should maybe invalidate the old agent. To implement that, we'd want to delete the old c-list and maybe transfer any low-level Bank module tokens to somewhere else. We'd need to pay attention to the provisioning flow to make sure the old key was fully decommissioned and powerless (the mere existence of a pubkey in a Bank module table might be enough to give it some power).

On the other hand, we might treat the old agent and the new agent as peers, and use this mechanism as the way to establish multiple coordinated user agents (#2628). In this approach, the backup privkey is the biggest authority, and each user agent (including the first) is a peer: distinguishable but equal authority. This isn't the perfect approach (asking humans to manage passwords and use them for non-recovery cases is not very kind, and leads to weak passwords), but most online services work this way: you install the mobile app and then type your main password into it.

A better approach is to use an authority grant/transfer mechanism from one live agent to another, and only use the recorded password/passphrase for recovery when all agents have failed simultaneously. I like Keybase's approach to this: any agent can be used to empower a new agent, you get a nice graph of which agent approved which, and paper keys are just another kind of agent. The agent-to-agent grant process involves QR codes instead of asking a human to type a long string. Other apps use short human-transcribed codes which can be secure if they use PAKE or SAS (I think I've seen this in Chromecast approvals, and some macOS/iOS/AppleTV account management tools).

Security Considerations

This is all about security considerations. In my experience, there are two phases to building a secure+usable platform that respects ocap principles. In the first phase, you boil down all the access control to a single private key. In Tahoe-LAFS we called this the "rootcap": a directory write-cap string which had pointers to all the rest of your stuff. No matter what you added/removed/changed, it was all reachable from this rootcap, so as long as you retained (or could recover) access to the rootcap, you could reach the rest of your files. And the normal ocap pattern ensured that there was no other (automatic) pathway, so we could teach users "knowledge of the cap is both necessary and sufficient to get your data". This improves the confidentiality/integrity, but threatens the availability, because everything depends on remembering that one root string.

Then, in the second phase, you build additional places to store the rootcap, that meet your desired confidentiality-vs-availability tradeoffs. You could print out a copy and keep it in a safe, if you're into safes. You could give a copy to your friend, if you trust your friend with that authority. You could secret-share it among three friends that don't know each other, you could record it with some centralized service whose account-management tooling you're comfortable with, etc. The point is that the second phase is all about policy, and you have lots of options to choose from, and there's nothing the system has already done that will reduce the security.

If you don't go through the first phase up front, then the story of what it takes to get to your data is hard to explain, and probably involves a bunch of loopholes. If you stop after the first phase and don't build usable tools, you're left with the terrifying state of cryptocurrencies from a few years back, where the only option was to hang on to a BIP32 passphrase for dear life: it's really unlikely that anyone else will steal your coins, but it's awfully easy for you to lose the key and lose your own access forever. These days we're slowly figuring out usable tools to do better than that, but I think it's still early and we have a lot more to go.

Swingset Backup/Recovery

For our users, I think the most valuable approach will be to make the account recoverable, not a specific SwingSet/ag-solo instance. However, we should consider what a "swingset backup" would look like, and what it would meant to recover from one. This approach would not require on-chain support code, and would recover the exact vats that were backed up, instead of replacing the vats with new ones (that happen to share some authority with the old ones).

If we went this way, we might aim to back up the entire kernel state: at the end of a "block" (a commit point, after which outbound messages are released from hangover-embargo), we compress and encrypt the entire kernel DB, and upload it to some online datastore, replacing the previous version. The users writes down the location of this store and the decryption key (the equivalent of a Tahoe-LAFS "writecap", more or less). If the swingset instance is lost, the user runs a recovery process (on a new machine or disk), which fetches the old kernel state and reloads from it, instead of a blank kernel. The new swingset would use the same private key to communicate with other nodes. From their point of view, the recovered kernel is identical to its predecessor.

It would be important to prevent two instances from running with the same state (and especially the same private key). The two instances would quickly diverge, which looks like equivocation from the outside, and would collapse horribly.

It would also be important to make sure the state is checkpointed before allowing outbound message to be emitted, which is a performance cost. We perform this checkpoint all the time, but only to the local disk. Adding "is committed to the remote store" to the definition of "safely checkpointed" will increase latency significantly. We can reduce the total data stored somewhat by uploading deltas: a periodic full kernel DB snapshot, plus a list of the messages added to the run queue since the last upload (and probably something to handle device state that isn't captured in the kernel DB). Then we can replay the kernel from an intermediate state, not unlike how we replay vats from a recent XS heap snapshot plus a truncated transcript. This won't remove the roundtrip latency needed for each checkpoint, but it might reduce the data transfer overhead to some minimum.

dckc commented 2 years ago

@erights the idea of support for upgrade limited to transferrable rights... the one that had a strong analogy between reloading a web page and restoring from sturdy refs... I wonder if that suggests a strategy for wallet recovery. Thoughts?

michaelfig commented 2 years ago

Some salient points from a discussion that @warner and I had:

We should probably target full SwingSet state backup and restore, as more fine-grained approaches seem to be more subtle engineering work that may not be as effective as a complete backup/restore.
We can start with integrated ag-solo backups, but the milestones along that road happen to be in the same direction as taking validator state snapshots and restoring another validator from Cosmos state-sync.
We can divide the phases into:
- Replace the swing-store with a version that records single-block deltas, i.e. writes a (compressed?) delta file on every swing-store.commit(), encrypts and signs the delta+reference to prior delta, and uploads that delta to a public service (IPFS?) and returns an "address" at which to find the contents of this new head of the backup chain of delta files starting with the current one and ending with the latest db-snapshot, and begins uploading that snapshot in the background
- Save the backup head "address" to a reliable, possibly replicated small-value storage service, maybe a non-SwingSet Agoric chain service
- When it is time to write a full SwingSet state snapshot after a commit, pause the input to the kernel, and call a helper program that quick-compresses all the SwingSet state files into a new db-snapshot directory, including a mention of the delta that this snapshot will replace
- Resume the input to the kernel, and begin uploading the db-snapshot in the background
- When the db-snapshot has been uploaded, commit its address to the small-value service and remove the local obsolete deltas and old snapshot
The cost of committing a delta to a remote store can be absorbed in the 5 (or more) seconds of leeway we have before we miss the next chain block.
The hot-wallet ag-solo has its own seed phrase (which will be needed to recover it) and we will use a different HD path for the ag-solo's backup signing and encryption keys. If integration with an on-chain service is necessary, a cold-wallet Ledger could be supported through the feegrant module so that the ag-solo can submit chain transactions.
If there is lag between when a recovery is made and when the latest backup head is published, we will need to fail gracefully. We could publish a timestamp on the small-value store to give an indication to the user how recent their published backups have caught up.
Agoric doesn't want to be in the business of storing large data, but running an on-chain service that just points off-chain may be acceptable.

dckc commented 2 years ago

Moving wallet client state on chain is an option that emerged this week. @michaelfig took the ball to propose a way to slice the wallet differently. I expect you'll make a separate issue for that, but until then, I'll assign this to you.

excerpt from meeting notes:

Wallet - moving more of the client on-chain

Why?

Avoid users having to manage client state (“I bricked my machine” #2629)
Making an electron app (cf. exo) is risky
Asking users to install an electron app is burdensome (though not beyond market norms)
reduce round-trips (#3802 )

ledger integration

@warner to research JSON blob details etc. what can be signed? Hash? JSON blob?

dckc commented 2 years ago

I'm pretty sure this is obsolete in favor of on-chain wallet plans (#3995).

Agoric / agoric-sdk