apache / accumulo

Apache Accumulo
https://accumulo.apache.org
Apache License 2.0
1.07k stars 445 forks source link

Encrypt Sorted Recovery files #2076

Closed milleruntime closed 2 years ago

milleruntime commented 3 years ago

During recovery, the write ahead logs are read, sorted and written back to disk using org.apache.hadoop.io.MapFile.Writer(). If the On Disk Encryption feature is used, the sorted Map files will not be encrypted during this time. In order for the On Disk Encryption to be complete, these files need to be encrypted on disk. The best solution would probably be to use RFile, like is used with the rest of Accumulo.

Code dealing with sorting: https://github.com/apache/accumulo/blob/bbd87a6693fc5cdbbe947b0821a4f05a18cf905b/server/tserver/src/main/java/org/apache/accumulo/tserver/log/LogSorter.java#L177 https://github.com/apache/accumulo/blob/7e3c67d0c0b831d7d29c69e24d07dd700f24c5e0/server/manager/src/main/java/org/apache/accumulo/manager/recovery/RecoveryManager.java#L151 Then tserver will recover the sorted logs: https://github.com/apache/accumulo/blob/30ce59fd94f51b60a30ced328156f02d3223330b/server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/Tablet.java#L357-L358 https://github.com/apache/accumulo/blob/6a74b4667e3bd33e34b5262c5dd8ea64167fb657/server/tserver/src/main/java/org/apache/accumulo/tserver/log/SortedLogRecovery.java#L285-L286

milleruntime commented 3 years ago

The first hurdle is changing SortedLogRecovery to use RFile instead of MapFile. That will probably be one pull request. Once that is done, then we will be able to use them for encryption.

@ctubbsii my initial work translating the WAL event data to RFiles implements a schema that stores LogEvents in the column family. I think this will allow using locality groups based on the different values of the enum. See https://github.com/apache/accumulo/blob/3fd5cad92f9b63ac19e4466f3f2d5237b905262c/server/tserver/src/main/java/org/apache/accumulo/tserver/logger/LogEvents.java

ctubbsii commented 3 years ago

I was thinking locality groups could be used to efficiently segregate information from different tablets. So, to recover a single tablet, you don't need to scan through any of the other tablets' data. Using locality groups for the log event types is a more traditional way of using locality groups where the locality group is configured in the Accumulo configuration properties. However, we don't really have a use case to only read a single event type at a time... we do have a use case to read all event types for a single tablet at a time. At least, I think that would be useful.

milleruntime commented 3 years ago

I can see where that would speed up recovery. I haven't done much with locality groups so not sure how that would work with the sorting of the log events. I was going with a schema that preserves the current way the WAL events sort, by Event + tabletId + sequence number:

        Row = EventTypeInteger_tabletID_seq
        Family = event
        Qualifier = tserverSession OR filename OR KeyExtent

I created the Event type integer based on what is returned by the compareTo() of LogFileKey. Then appended the tabletId integer (which is written from commitSession.getLogId()) and finally the sequence number. The event type determines what will be stored in the column qualifier. https://github.com/apache/accumulo/blob/583ca55ceb5402d6bbd0fb093feccc897234582f/server/tserver/src/main/java/org/apache/accumulo/tserver/logger/LogFileKey.java#L155-L164

I was using scanner.fetchColumnFamily(cf); where the cf could be 0 or many Text... colFamToFetch and passing in the different events to fetch. Maybe I can put up a draft PR of just the LogFileKey.toKey() method so you can provide some feedback. There are already a good number of changes but its mostly changing all the iterators to work with Key/Value.

milleruntime commented 3 years ago

If I gave each enum in LogEvents a unique number for sorting, then I wouldn't have to store the string in the column family. The new numbering could look like this:

    // OPEN = 0
    // DEFINE_TABLET = 1
    // COMPACTION_START = 2
    // COMPACTION_FINISH = 3
    // MUTATION = 4
    // MANY_MUTATIONS = 5

It would be extra convenient if the ordinal() of the enum corresponded to these numbers but currently that is not the case. The sorted WALs between previous versions and this one will already be incompatible so it may not be a big deal to change the order of the LogEvents enum. But I wonder if it could lead to some crazy bugs if a user were to recover an old WAL.

keith-turner commented 3 years ago

I was thinking locality groups could be used to efficiently segregate information from different tablets. So, to recover a single tablet, you don't need to scan through any of the other tablets' data.

I think the sorted WALs are structured such that needed information sorts together and can be efficiently found via seek. I have not looked at the code in a while, but I am not sure sorted WALs are ever scanned entirely. I think they may only be seeked to specific places. If this is true then there may not be a benefit to LGs. I think the sort segregates information as needed.

keith-turner commented 3 years ago

But I wonder if it could lead to some crazy bugs if a user were to recover an old WAL.

Would need to maintain code to recover old WALs, or refuse to upgrade if WALs are present forcing user to run old version and resolve WALs. Refusing to upgrade could be annoying for the user as its difficult to know if there are WALs. They could end up in a loop trying to run new version and having to fall back to old version.

ctubbsii commented 3 years ago

I wouldn't use an integer in place of the enum. RFile is already probably fine at compressing, and there's value in being able to actually read the file contents while troubleshooting. I don't think it would save much.

As Keith says, the current key scheme may be adequate... it's certainly the simplest migration. In that case, you don't need to use the CF, CQ, or CV at all. Just Row/Value.

Where I was going with the LG conversation was that I was just thinking that we could eliminate the tablet ID in the key structure, and make it efficient to recover an individual tablet if each tablet got its own locality group. We might even be able to eliminate the tabletID mapping and just use the extent directly, since it wouldn't be duplicated in the key. I don't see any advantage to using the event type for locality groups, though. We don't need efficient reads of all events of a specific type (I don't think so, anyway).

The wrench in my thinking is that I'm not that familiar right now with the recovery process, and I'm not actually sure why event type sorts before tablet ID.

milleruntime commented 3 years ago

As Keith says, the current key scheme may be adequate... it's certainly the simplest migration. In that case, you don't need to use the CF, CQ, or CV at all. Just Row/Value.

The current scheme in LogFileKey is just an implementation for WritableComparable which includes serialization/deserialization and the compareTo method. Part of this task is translating LogFileKey into an Accumulo Key so we can store it in an RFile. I am also translating LogFileValue into an Accumulo Value (but this is just the mutations serialized). We need somewhere to put all the information serialized in LogFileKey into Key, otherwise it will be lost when sorted.

We don't need efficient reads of all events of a specific type (I don't think so, anyway).

There are a few places during recovery where we iterate over the logs to get different types of information. The first time we only look for DEFINE_TABLET events in the log. This is where I got the idea to use event for CF. Check it out here: https://github.com/apache/accumulo/blob/6a74b4667e3bd33e34b5262c5dd8ea64167fb657/server/tserver/src/main/java/org/apache/accumulo/tserver/log/SortedLogRecovery.java#L105

The wrench in my thinking is that I'm not that familiar right now with the recovery process, and I'm not actually sure why event type sorts before tablet ID.

Join the club! I still don't understand tabletId and seq in LogFileKey. This comment here makes me think its tracking in each WAL: https://github.com/apache/accumulo/blob/6a74b4667e3bd33e34b5262c5dd8ea64167fb657/server/tserver/src/main/java/org/apache/accumulo/tserver/log/SortedLogRecovery.java#L290-L292

milleruntime commented 3 years ago

Need to finish #2197 before completing this task.

milleruntime commented 2 years ago

This is done so can be closed. Any work done for #2197 would be a separate task.