ZcashFoundation / zebra

Zcash - Financial Privacy in Rust 🦓
https://zfnd.org/zebra/
Apache License 2.0
410 stars 100 forks source link

Create and parse checkpoint height/hash lists #564

Closed teor2345 closed 4 years ago

teor2345 commented 4 years ago

For the checkpoint verifier, we need to create checkpoint height and hash lists for:

Since the sapling checkpoint is mandatory, we want to compile the lists into Zebra.

Eventually, we also want to:

Open questions:

The block spacing range for 256 MB checkpoints is:

Heartwood activated at:

Our initial draft list of checkpoints was created before Heartwood on mainnet, but after Heartwood on testnet. The testnet checkpoints include a post-Heartwood block.

hdevalence commented 4 years ago

Hashes can be obtained from zcashd using zcash-cli getblock $i and using jq or something to parse the JSON. Note that the output will have to be run through zebrad revhex since zcashd displays hashes in reversed byte order and we don't.

teor2345 commented 4 years ago

I'll do a small script, so we can rebuild the list consistently.

@hdevalence do we have a preferred scripting language?

hdevalence commented 4 years ago

No, I don't think so, we've never had to write scripts before.

yaahc commented 4 years ago

My personal rule of thumb is bash for things that are near enough to oneliners for me not to worry about the maintainability and rust for all other cases.

A cool thing cargo does is automatically create a target for each file in src/bins, so you could for example do a cargo new --lib zebra-utils or w/e you wanna call the crate that holds all the "scripts" and then create a bins dir in that and just create a file for each new binary. This is what I spend most of my last two years with rust doing so if you need recommendations for libraries to use when writing small rust cli scripts lmk.

teor2345 commented 4 years ago

I think the script will look like

i=0
while $i < $[`zcash-cli getblockcount | jq ...` - 300]; do
    zcash-cli getblock $i | jq ... | zcashd revhex >> checkpoints.txt
    i=$[i + 500]
done

Whatever we choose, it will become a release-time dependency. I'll make an initial script using generic unix shell. Then we can convert it to Rust later, if we decide that's important.

Also, maybe it would be faster to use getblockhashes, and discard the hashes we don't want: https://zcash-rpc.github.io/getblockhashes.html I'll do some testing.

teor2345 commented 4 years ago

I've been thinking about this, and calling the RPC multiple times in the script seems slow and error-prone. We're going to have to parse json, call jq and zebrad revhex, and generate the checkpoint heights.

I'd rather do all that work in Rust, using serde-json: zcash-cli getblockhashes INT_MAX 0 | zebrad checkpoints

We'll need a zebrad checkpoints --testnet option, to generate the testnet list.

Once that's done, we could even add modes for generating our own checkpoints, or listing our hard-coded checkpoints:

(But I'll just do the zcashd mode for now.)

yaahc commented 4 years ago

could we have zebrad checkpoints try to call zcash-cli internally rather than having to pipe the output into zebrad?

teor2345 commented 4 years ago

Maybe?

For the testnet, the command-line would have to be zcash-cli -testnet getblockhashes INT_MAX 0 | zebrad checkpoints. And there's no guarantee that we'd get the RPC authentication options right, so we'd have to allow the pipe anyway.

yaahc commented 4 years ago

For the testnet, the command-line would have to be zcash-cli -testnet getblockhashes INT_MAX 0 | zebrad checkpoints

we could expose a --testnet arg that adds the -testnet arg to the internal zcash-cli invocation we call

And there's no guarantee that we'd get the RPC authentication options right

I don't know what this means

teor2345 commented 4 years ago

For the testnet, the command-line would have to be zcash-cli -testnet getblockhashes INT_MAX 0 | zebrad checkpoints

we could expose a --testnet arg that adds the -testnet arg to the internal zcash-cli invocation we call

You're right, we'll definitely need a --testnet option, to supply the right hard-coded genesis block hash.

And there's no guarantee that we'd get the RPC authentication options right

I don't know what this means

For example, the data directory, and RPC IP, port, username and password. (None of these options are required, but we can't guarantee the user is using the defaults.)

https://helpmanual.io/man1/zcash-cli/

yaahc commented 4 years ago

For example, the data directory, and RPC IP, port, username and password. (None of these options are required, but we can't guarantee the user is using the defaults.)

can we just re-export all of these knobs as cli flags like we do with -testnet?

teor2345 commented 4 years ago

We could passthrough all unrecognised arguments.

teor2345 commented 4 years ago

There are only 10 options, so it should be easy to list them all. (Hopefully they don't add any extra options.)

yaahc commented 4 years ago

I tried deleting the comment before you got to it but the more I thought about it I realized it should be pretty easy to just forward the args like you said and it has the advantage of dealing with the extra options. I'll have to check out gumdrop in more detail but I know exactly how to do this with structopt so worse case scenario we could just manually re-parse the args ourselves via structopt in that specific subcommand.

teor2345 commented 4 years ago

We'll want to parse and forward -testnet and -regtest, so we can select the right genesis block. And just forward everything else.

hdevalence commented 4 years ago

Hmm, I'm not sure that adding a new command to zebrad is preferable to having a standalone script, because it means that the checkpoint generation is shipped as part of the main functionality of the node software, rather than being a step used to create an artifact that's part of the build. I don't see us needing to generate checkpoint lists very often -- I think that it's enough to keep a file in git and periodically run a script to regenerate it and check in the changes.

hdevalence commented 4 years ago

Another thought (branching off of here) is that since checkpoints are generated retrospectively, not prospectively, we can tune the checkpoint heights according to the block sizes.

Since we know the sizes of all of the blocks we're checkpointing, we can estimate how much memory will be used for queued blocks in the honest case (assuming that the in-memory parsed representation has the same size as the serialized representation, which isn't exactly true but is true within a small constant factor). The worst-case analysis of memory usage assumes that we queue a whole bunch of dishonestly generated blocks that are too big. I don't think this is important to defend against, because (a) we already have mechanisms that make it unlikely to chase a false chain and (b) the failure is a transient OOM error, from which we can just restart and talk to other peers. (If we are unable to talk to honest peers, we have bigger problems).

We might want to target a lower number rather than a higher one, say, not accumulating more than 256MB of block data at a time. Since most Zcash blocks are smaller than the maximum block size, I don't think this will be super constraining in practice.

teor2345 commented 4 years ago

Hmm, I'm not sure that adding a new command to zebrad is preferable to having a standalone script, because it means that the checkpoint generation is shipped as part of the main functionality of the node software, rather than being a step used to create an artifact that's part of the build.

I think you're right, let's use a script for this release. If the script gets too complex, we can rewrite it in another language. (And if that language is Rust, we can do that in a separate repository that depends on zebra-chain and maybe zebra-consensus.)

I don't see us needing to generate checkpoint lists very often -- I think that it's enough to keep a file in git and periodically run a script to regenerate it and check in the changes.

In tor we refreshed a similar list (long-lived node addresses) every 6-12 months.

It was a python script :-)

teor2345 commented 4 years ago

Another thought (branching off of here) is that since checkpoints are generated retrospectively, not prospectively, we can tune the checkpoint heights according to the block sizes.

Since we know the sizes of all of the blocks we're checkpointing, we can estimate how much memory will be used for queued blocks in the honest case (assuming that the in-memory parsed representation has the same size as the serialized representation, which isn't exactly true but is true within a small constant factor). …

We might want to target a lower number rather than a higher one, say, not accumulating more than 256MB of block data at a time. Since most Zcash blocks are smaller than the maximum block size, I don't think this will be super constraining in practice.

I'll try to implement this design in the checkpoint script.

The worst-case analysis of memory usage assumes that we queue a whole bunch of dishonestly generated blocks that are too big. I don't think this is important to defend against, because (a) we already have mechanisms that make it unlikely to chase a false chain and (b) the failure is a transient OOM error, from which we can just restart and talk to other peers. (If we are unable to talk to honest peers, we have bigger problems).

My goal was to make quick changes that defended against some worst case attacks, and honest-case overload behaviour. But it looks like that's more complicated than I expected, so I won't be able to do it quickly. And I should move on to more important tasks.

In general, I agree with your conclusions, so I'm going to avoid spending time on the worst case, and update the design doc to explain why the worst case is out of scope.

But I think we're talking about slightly different worst-case attacks, so I'd like to check your analysis of the following attack:

A single bad peer answers any block hash request with a large block, that has a height near the end of the checkpoint list. We store those blocks, but can't verify them yet, so we run out of memory.

To defend against this attack, we need to avoid downloading a large number of blocks (or a large amount of data) from a single peer. For example, if our memory limit is 2 GB, we need to avoid downloading more than 1000 blocks from a bad peer.

There are 900,000 blocks in the chain, and only a few hundred peers. So if we choose peers uniformly at random, we're almost certain to download a few thousand blocks from each peer.

As an alternative, we need to reject blocks that are a long way ahead of the current checkpoint. Or reject blocks that are larger than expected.

There are also operating systems that hang or terminate other processes when out of memory (macOS default config, some Linux configs, maybe Windows). I don't think that's a big issue, but it should definitely be documented somewhere.

teor2345 commented 4 years ago

The average block size is around: 30 GB / 900,000 = 35 kB

So if we limit our checkpoints to 256 MB in the honest-case, most checkpoints will be 5000 - 10,000 blocks apart: 256 MB / 35 kB = 7300

So we might want to pick a maximum interval size. And then use smaller checkpoints as needed. I'll check when I write the script.

We could also find the maximum block size in the historic chain, and refuse larger blocks while checkpointing. We could efficiently reject blocks by recording the block size at parse time. (And if there are a few large blocks in the chain, we could record the largest block in each checkpoint range.) I'll add this to the possible designs.

yaahc commented 4 years ago

fwiw @teor2345 I already added a zebra-utils crate to the workspace and a zebra-checkpoints.rs binary file to use to write this in rust

hdevalence commented 4 years ago

But I think we're talking about slightly different worst-case attacks, so I'd like to check your analysis of the following attack:

A single bad peer answers any block hash request with a large block, that has a height near the end of the checkpoint list. We store those blocks, but can't verify them yet, so we run out of memory.

If I understand the attack correctly, this shouldn't be possible, because the network layer ensures that a request for a block with a particular hash resolves either to a block with that hash or to an error.

This actually happens not for a security reason but for a practical one. Because Bitcoin doesn't actually have a request/response concept, the network layer has to statefully examine the messages that come in to determine whether they could be a response to the pending request, and to determine when enough messages have been received to "finish" the response. In the case of a block request, the Handler maintains a HashSet of the still-outstanding block hashes and a Vec<Arc<Block>> of the already-received blocks. Each time a block message is received, the Handler computes the block hash and either removes the hash or discards the block. This allows it to determine when it has received all of the blocks required to complete the request. Discarding blocks with the wrong hash is necessary because zcashd will sometimes respond with the wrong block. I don't actually know why this happens. Maybe zcashd gets confused sometimes.

But the net of all of this is that a bad peer that answers any block hash request with an unrelated large block will cause the block request to time out, which causes us to fail the connection and stop talking to that peer.

To defend against this attack, we need to avoid downloading a large number of blocks (or a large amount of data) from a single peer. For example, if our memory limit is 2 GB, we need to avoid downloading more than 1000 blocks from a bad peer.

There are 900,000 blocks in the chain, and only a few hundred peers. So if we choose peers uniformly at random, we're almost certain to download a few thousand blocks from each peer.

Although we're certain to download a few thousand blocks from each peer, we're certain to do so only over the course of the entire sync process -- we'd never ask a single peer for more than a small chunk of contiguous blocks (I think this is currently 10, the exact number doesn't really matter so much as the fact that it's quite small). And because we know that any request for specific block hashes either resolves to those blocks or to an error, either each chunk succeeds, in which case we have the blocks we want, or it fails, in which case we fail that connection and retry (a limited number of times) with a different peer.

So I don't think that there's a possibility for a single peer to significantly interrupt the checkpointing procedure.

teor2345 commented 4 years ago

Thanks, that's a much stronger security guarantee than I expected. I'll make sure I document that in #636.

(And close the PRs that are obsoleted by the new design.)