cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.05k stars 3.8k forks source link

storage: O(num-sstables) bottleneck in encryption at rest #55967

Closed petermattis closed 3 years ago

petermattis commented 3 years ago

The Engine.GetEnvStats method retrieves stats about an engine's environment. Concretely, this means retrieving encryption-at-rest statistics.

In order to compute the number of bytes in files encrypted by the active key, Pebble.GetEnvStats has to loop over all of the sstables:

    sstSizes := make(map[pebble.FileNum]uint64)
    for _, ssts := range p.db.SSTables() {
        for _, sst := range ssts {
            sstSizes[sst.FileNum] = sst.Size
        }
    }

    for filePath, entry := range fr.Files {
        keyID, err := p.statsHandler.GetKeyIDFromSettings(entry.EncryptionSettings)
        if err != nil {
            return nil, err
        }
        if len(keyID) == 0 {
            keyID = "plain"
        }
        if keyID != activeKeyID {
            continue
        }
        stats.ActiveKeyFiles++

        filename := p.fs.PathBase(filePath)
        numStr := strings.TrimSuffix(filename, ".sst")
        if len(numStr) == len(filename) {
            continue // not a sstable
        }
        u, err := strconv.ParseUint(numStr, 10, 64)
        if err != nil {
            return nil, errors.Wrapf(err, "parsing filename %q", errors.Safe(filename))
        }
        stats.ActiveKeyBytes += sstSizes[pebble.FileNum(u)]
    }

We've pushed to get rid of other similar O(num-sstable) computations recently. This one isn't horrible because GetEnvStats is called by Store.ComputeMetrics (every 10s) and by statusServer.Stores (an RPC endpoint). Still, it would be desirable to get rid of this one. It seems feasible for the file registry to keep track of the number of files encrypted with the active key and their cumulative size.

I suspect there are some other O(num-sstable) bottlenecks in the encryption-at-rest code. My recollection is that we're rewriting the entire manifest whenever a new file is added.

Cc @jbowens

Epic: CRDB-2564

gz#8996

jbowens commented 3 years ago

Reiterating what @sumeerbhola said during standup, we are rewriting the entire file registry whenever adding, renaming or removing a file.

Right now the registry file is a single proto:

type FileRegistry struct {
    // version is currently always Base.
    Version RegistryVersion `protobuf:"varint,1,opt,name=version,proto3,enum=cockroach.storage.enginepb.RegistryVersion" json:"version,omitempty"`
    // Map of filename -> FileEntry.
    // Filename is relative to the rocksdb dir if the file is inside it.
    // Otherwise it is an absolute path.
    // TODO(mberhault): figure out if we need anything special for Windows.
    Files map[string]*FileEntry `protobuf:"bytes,2,rep,name=files,proto3" json:"files,omitempty" protobuf_key:"bytes,1,opt,name=key,proto3" protobuf_val:"bytes,2,opt,name=value,proto3"`
}

We could add a new proto FileRegistryEdit that describes a delta to the registry and add a new version number to the RegistryVersion. When loading the file registry from disk, we could read the FileRegistry proto and if the version number is the new one, read FileRegistryEdit proto messages until EOF.

petermattis commented 3 years ago

Protobufs are not size delimited, so the only way we know the bounds of the encoded FileRegistry is by looking at the file size. I'm thumbs-up on going down the path of some sort of edit-based FILE_REGISTRY, though we may want to use the Pebble's RecordWriter for doing this.

Alternately, I wonder if we could encode the file registry info inside the filename. The important bits are the EncryptionSettings, which are composed of key_id, nonce, and counter. My thought is to replace <filenum>.{sst,log} with <filename>.<encryption-info>.{sst.log}. We know the encryption settings when a file is created, and we extract them back out when we list files via FS.List. There is a challenge for operations like Open and Stat, but I think if encryptedFS maintained a map from Pebble file name to encrypted file name we could make that all work.

sumeerbhola commented 3 years ago

I'm thumbs-up on going down the path of some sort of edit-based FILE_REGISTRY, though we may want to use the Pebble's RecordWriter for doing this.

I agree, we should the record.{Reader,Writer} in Pebble instead of rolling something new. We also get per record checksumming. We would be exposing something that sits in the internal directory, but it is not something that is likely to change. Alternatively, we could move the encryptedFS into Pebble, or move most except for the StoreKeyManager implementation. The StoreKeyManager is the integration point with the environment, currently expecting key files, which I am assuming is not well-liked by users, and I don't think Pebble should get into the business of this integration. With the right interface tweaks we could support multiple caller-provided implementations of a StoreKeyManager, and not have to maintain an enum like EncryptionKeySource inside Pebble.

petermattis commented 3 years ago

We would be exposing something that sits in the internal directory, but it is not something that is likely to change.

This is ok by me. The intention behind keeping implementations details private in the /internal directory was to avoid needing to support them before we knew we needed to support them. I think the need to use the record reader and writer externally for encryption-at-rest is a sufficient motivation to export them.

Alternatively, we could move the encryptedFS into Pebble, or move most except for the StoreKeyManager implementation.

In order to do this, we'd either have to license part of Pebble under the CCL or move the encryptedFS code to be licensed under Apache.

jbowens commented 3 years ago

If we tackle this in the 21.2 release, maybe we can also pick up #59771.

jbowens commented 3 years ago

Related to fixing #63847

petermattis commented 3 years ago

This is an alternative idea for how to avoid storing encryption metadata in a separate manifest. I haven't fully thought this through, but I haven't seen any serious obstacles from what I've looked at so far.

I think we can store encryption metadata in the regular Pebble manifest. The challenge with doing this is passing the metadata in and out of the manifest. I think we can achieve that by adding either a new interface to pebble.Options, or by having an optionally implemented interface on vfs.FS. The interface would look something like:

type MetadataFS interface {
  GetMetadata(name string) (string, error)
  RestoreMetadata(name, metadata string)
}

When a file is added to Pebble's manifest we'd call through to MetadataFS.GetMetadata to retrieve the metadata for the file and store it inside the manifest entry. The manifest on-disk structure allows for extensions like this. The metadata would be an encoded enginepb.FileEntry (i.e. exactly the per-file metadata we have today).

When the manifest is replayed at startup for each file found we'd call MetadataFS.RestoreMetadata which would allow encryptedFS to update its internal state.

I can't recall offhand if the MANIFEST itself is encrypted. I think it is, but I suspect we don't need to encrypt it. Ditto for the OPTIONS file. encryptedFS could treat any file it doesn't have metadata for as unencrypted.

Migration from the existing setup would occur automagically as the MANIFEST is rewritten at startup, though we'd actually want finer-grained control of this so the migration doesn't occur until after the version upgrade is locked in.

jbowens commented 3 years ago

I like the idea of using the MANIFEST instead of a separate registry. They serve similar purposes. Combining them reduces the writes and fsyncs, which is good for performance and reasoning about failure scenarios.

I can't recall offhand if the MANIFEST itself is encrypted. I think it is, but I suspect we don't need to encrypt it.

Is it okay to leave the sstable boundary user keys unencrypted? Seems undesirable. It seems like it shouldn't be too burdensome to encrypt it though.

Sideloaded sstables might complicate things, since they're not encoded in the MANIFEST until they're ingested.

Are temporary engines another minor complication? I assume right now they use the same vfs.FS and file registry as the store. If we need to encrypt the MANIFEST, I guess we'll need to just account for multiple encrypted MANIFESTs and use separate encryptedFSs.

petermattis commented 3 years ago

That's a good point about the sstable boundary keys. The MANIFEST definitely needs to be encrypted. One way of handling that is to store the encryption metadata for the MANIFEST in a separate file. Perhaps MANIFEST-xyz.metadata.

In addition to the conceptual cleanliness of keeping the encryption metadata in Pebble's MANIFEST, there is also a performance advantage: doing so minimizes the number of fsync calls. If the encryption metadata is in a separate file, even a record IO file like the MANIFEST, then we have to fsync both the MANIFEST and the encryption MANIFEST.

Are temporary engines another minor complication? I assume right now they use the same vfs.FS and file registry as the store. If we need to encrypt the MANIFEST, I guess we'll need to just account for multiple encrypted MANIFESTs and use separate encryptedFSs.

Yes, temporary engines are another minor complication. I don't know what to do about them. Perhaps we can figure out a way to have the temporary engines create an empty Pebble DB and to have some sort of "external" file facility.

Another minor complication is ingested files. Those files need to be encrypted when written and then that encryption metadata needs to be transferred into Pebble. The above mentioned "external" file facility could possibly be used here as well.

sumeerbhola commented 3 years ago

there is also a performance advantage: doing so minimizes the number of fsync calls

Aren't WAL writes the primary source of fsyncs and not file creation?

btw, we don't currently re-encrypt existing files after key rotation. This is a weakness that we may need to fix, by doing background work to slowly decrypt/re-encrypt. There are some difficulties in re-integrating the newly encrypted file, both into in-memory state and the persistent state. Using file rename seems simple, and handles in-memory state, but we can't atomically change the persistent registry (whether it is separate or in the MANIFEST), which records the key id etc., and rename the file. It may be worth thinking through this to see if it suggests one approach is better than another. And if we place metadata in the MANIFEST, we would need to be able to change the metadata at a later time.

petermattis commented 3 years ago

Aren't WAL writes the primary source of fsyncs and not file creation?

Yes.

btw, we don't currently re-encrypt existing files after key rotation. This is a weakness that we may need to fix, by doing background work to slowly decrypt/re-encrypt. There are some difficulties in re-integrating the newly encrypted file, both into in-memory state and the persistent state. Using file rename seems simple, and handles in-memory state, but we can't atomically change the persistent registry (whether it is separate or in the MANIFEST), which records the key id etc., and rename the file. It may be worth thinking through this to see if it suggests one approach is better than another. And if we place metadata in the MANIFEST, we would need to be able to change the metadata at a later time.

Encrypting/decrypting an sstable could be performed by swapping the new table for the old table in the LSM tree. The old table would then be added to the obsolete tables list and be subsequently deleted. If we crash before deleting the obsolete table then the scan for obsolete files at startup would catch it. I suspect it doesn't matter whether the encryption metadata is stored within Pebble's MANIFEST or separately in order for this sstable swapping to be done.

sumeerbhola commented 3 years ago

Encrypting/decrypting an sstable could be performed by swapping the new table for the old table in the LSM tree.

I agree that doing it the same way we handle file changes during compactions/flushes is reasonable. I was initially trying to see if there were any shortcuts here, but I can't think of any. I do think that the interface between Pebble and the external entity doing these rewrites should be better fleshed out -- there is a potential for races (which we may choose to allow or disallow) between compactions and such rewrites, and we don't currently allow something external to Pebble to tell it to remove a file. An interface to request a "file-rewrite compaction" (which doesn't iterate over the contents of the sstable) from Pebble may be sufficient.

jbowens commented 3 years ago

One thing I didn't really think through was the implication of the synchronous syncs we need to perform during a flush and how that could affect our flush throughput. If we split a flush into n sstables, today we need to sync:

With encryption-at-rest metadata outside the manifest, we need to sync:

These file registry syncs only need to sync ~1 block (vs the previous O(# version sstables)), since we only need to write a small version edit to the registry. Flushing the memtable is a critical section to new writes, so waiting for n extra syncs is unfortunate.

One idea: don't sync the file registry on every operation. Instead, sync it only when a directory is synced.

petermattis commented 3 years ago

One idea: don't sync the file registry on every operation. Instead, sync it only when a directory is synced.

I like this idea.

jbowens commented 3 years ago

If I understand correctly, our current encryptedFS doesn't guarantee atomicity of some operations:

// Rename implements vfs.FS.Rename.
func (fs *encryptedFS) Rename(oldname, newname string) error {
    if err := fs.FS.Rename(oldname, newname); err != nil {
        return err
    }
    return fs.fileRegistry.MaybeRenameEntry(oldname, newname)
}

A crash during (*encryptedFS).Rename after the fs.FS.Rename but before MaybeRenameEntry would result in a file that fails to decrypt. If the Rename operation was for the CURRENT file, the entire store would fail to open.

We could adjust the file entry to record both new and old encryption settings for the path, and we could write the updated entry to the registry first. Reads would need to fallback to the old settings if using the new encryption settings fails to decrypt the file. But if we defer the registry fsync until the explicit directory fsync, what happens if there's memory pressure and the OS decides to do its own writeback of the directory's dirty blocks before the registry's? With an ill-timed hard reset, you could again end up with files you can't decrypt.

I'll keep thinking on this.

sumeerbhola commented 3 years ago

our current encryptedFS doesn't guarantee atomicity of some operations

Are there cases

jbowens commented 3 years ago

I'd prefer if we don't need to make any atomicity promises.

I am most worried about CockroachDB usage via storage/fs.FS. The surface area within Pebble is relatively small and well understood by us. Within CockroachDB, I'm wary of accidental usage of the storage/fs.FS interface with atomicity expectations matching typical filesystem guarantees. I think we need to be very explicit about what guarantees we do provide.

Are we getting atomicity guarantees for rename from the underlying filesystems (in the case of machine crashes)?

Yes, the filesystem provides an atomicity guarantee for rename. Pebble relies on this (eg, for atomic manifest rotation via a CURRENT rename). My understanding is that on Linux, if the destination already exists, the filesystem guarantees allow for a period of time when the both the source path and destination path both point to the file being renamed.

sumeerbhola commented 3 years ago

Pebble relies on this (eg, for atomic manifest rotation via a CURRENT rename). My understanding is that on Linux, if the destination already exists, the filesystem guarantees allow for a period of time when the both the source path and destination path both point to the file being renamed.

I agree about the Linux semantics, which is also consistent with the fact that a machine crash can result in a hard link, but not execute the subsequent removal. The CURRENT case is very narrow -- we don't really need that file to be encrypted.

Regarding CockroachDB, I think we need to audit and probably add some stricter test behavior anyway -- e.g. we seem to be using Rename for directories in newDiskSideloadStorage.

jbowens commented 3 years ago

The CURRENT case is very narrow -- we don't really need that file to be encrypted.

Agreed, it doesn't need to be encrypted. Are you suggesting we carve out an exception within encryptedFS for files prefixed with CURRENT?

sumeerbhola commented 3 years ago

Are you suggesting we carve out an exception within encryptedFS for files prefixed with CURRENT?

Probably not -- it is too much of a hack. I was thinking of passing in an optional second FS implementation when opening Pebble, for use with CURRENT. We would want a func RenameIsAtomic() bool in the FS interface, such that the open fails if there is only one FS implementation provided that returns false from this func.

We'd need to document that the default semantics for Rename are non-atomic and a failure cannot be retried. And for Link it is non-atomic, but a failure can be retried (or maybe not -- I don't know if go's os.Link is like ln -f). We could then outline what is safe for the callers to use these for:

I think the last case covers the use of Link for ingestion.

In general it would be nice if we can get away with weak semantics. I'm not keen on detecting that the file contents are not properly decrypting at read time and then falling back to a different (key, nonce, counter) tuple -- we are not storing a cryptographic hash for integrity purposes (and we don't want to, for the usual case of reading only parts of sstables), so anything will successfully decrypt, and the detection of whether it is garbage will need to rely on higher layers.

Some other options

jbowens commented 3 years ago

I might be missing something, but isn't storing CURRENT in plaintext a backwards incompatible change too? We could begin storing it unencrypted going forward, but not until the CockroachDB version upgrade is finalized. Then on start we can't know whether the CURRENT file was written pre or post version finalization without trying and falling back, at least for the CURRENT file.


I'm not keen on detecting that the file contents are not properly decrypting at read time and then falling back to a different (key, nonce, counter) tuple -- we are not storing a cryptographic hash for integrity purposes (and we don't want to, for the usual case of reading only parts of sstables), so anything will successfully decrypt, and the detection of whether it is garbage will need to rely on higher layers.

Another idea is that the new registry format could encode a mapping from inode to (key, nonce, counter) tuple. The Rename operation becomes atomic just by relying on the underyling filesystem's atomicity. Reading files becomes more complicated though, because you would need to open the file, then fstat the file descriptor to get the inode. I'm also not sure about portability. Does Windows expose a similar concept?

jbowens commented 3 years ago

Bumping the above discussion ^

sumeerbhola commented 3 years ago

but isn't storing CURRENT in plaintext a backwards incompatible change too? We could begin storing it unencrypted going forward, but not until the CockroachDB version upgrade is finalized. Then on start we can't know whether the CURRENT file was written pre or post version finalization without trying and falling back, at least for the CURRENT file.

Agreed that it is a backwards incompatible change. Instead of reusing the name CURRENT, we could use CURRENT-v2 and not have to try and then fallback.

Another idea is that the new registry format could encode a mapping from inode to (key, nonce, counter) tuple. ... Does Windows expose a similar concept?

I don't think windows exposes inodes -- I'm not sure we can go down this path.

jbowens commented 3 years ago

I don't think windows exposes inodes -- I'm not sure we can go down this path.

Looks like Windows exposes a concept called a 'File ID', that's a 64-bit integer uniquely identifying a file within a volume. The FileIndexHigh and FileIndexLow from windows.ByHandleFileInformation together form the 'File ID'.

jbowens commented 3 years ago

The incremental file registry is merged. I'm going to close this out, and the remaining work dealing with nonatomic renames is tracked in #69861.