transactional guarantees and recoverability

couchbaselabs / ForestDB-Benchmark

Benchmark program for ForestDB, Couchstore, LevelDB, RocksDB, and WiredTiger

Apache License 2.0

34 stars 21 forks source link

transactional guarantees and recoverability #11

Open keithbostic opened 8 years ago

keithbostic commented 8 years ago

What are the transactional guarantees enforced by each of these engines/configurations? I'm asking because:

WiredTiger is configured for both checkpointing and the logging, which may or may not make sense,
WiredTiger's checkpoint frequency is configured to match the LSM engines' compaction frequency, which doesn't make sense,
WiredTiger is logging to the same disk partition as the data store, which will significantly affect performance.

greensky00 commented 8 years ago

Hi Keith, Thanks for your comments. The recoverability that this benchmark program wants to guarantee is that all succeeded write batch should be recoverable after system crash such as sudden power failure.

The reason why both checkpoint and logging are enabled concurrently is because of space issue; as I mentioned in one of comments at https://github.com/couchbaselabs/ForestDB-Benchmark/pull/2, the overall space occupied by log files is getting grow continuously, and we need to limit that kind of space increasing for fair comparison with other DB modules. If there is a better way to restrict the size of log files rather than using checkpoint, please let me know; I will adapt the wrapper code.
I’m not sure if I understand correctly; I think the “checkpoint frequency” you mentioned is ‘[compaction]:period’ value in configuration file, and that value affects WiredTiger, Couchstore, and ForestDB only, not LSM-based modules such as RocksDB and LevelDB. As I aforementioned, the checkpoint in WiredTiger is currently used for the restriction of space occupied by log data, which is quite similar to “compaction” process in Couchstore and ForestDB. So they need to have the same period for fair comparison. Please note that the compaction frequency of LSM-based modules is not manually set in this benchmark program.
I agree that logging to physically separate disk will greatly improve the overall performance, but other DB modules also will experience similar improvement if they can use separate disk.

mdcallag commented 8 years ago

I prefer multiple options for recoverability and then make it clear which one was used. In an old blog post I use the terms "durable", "not durable" and "really not durable" as the 3 options. Not sure "commit" is a command here and "per operation" might be more accurate:

durable -> durable on commit
not durable -> recovers to a consistent point in time, can lose the last N commits
really not durable -> might not recover to any consistent point in time on a crash

Explained here - http://mysqlha.blogspot.com/2010/03/durable-not-durable-and-really-not.html

With RocksDB:

durable -> set WriteOptions::sync=1, WriteOptions::disableWAL=0
not durable -> enable WriteOptions::sync=0, WriteOptions::disableWAL=0
really not durable -> set WriteOptions::sync=0, WriteOptions::disableWAL=1, DBOptions::disableDataSync=1

We use "durable" and "not durable" frequently in production. We rarely use "really not durable". Maybe someone doing batch processing can do that.

On Thu, Nov 26, 2015 at 8:11 AM, Jung-Sang Ahn notifications@github.com wrote:

Hi Keith, Thanks for your comments. The recoverability that this benchmark program wants to guarantee is that all succeeded write batch should be recoverable after system crash such as sudden power failure.

1.

The reason why both checkpoint and logging are enabled concurrently is because of space issue; as I mentioned in one of comments at #2 https://github.com/couchbaselabs/ForestDB-Benchmark/pull/2, the overall space occupied by log files is getting grow continuously, and we need to limit that kind of space increasing for fair comparison with other DB modules. If there is a better way to restrict the size of log files rather than using checkpoint, please let me know; I will adapt the wrapper code. 2.

I’m not sure if I understand correctly; I think the “checkpoint frequency” you mentioned is ‘[compaction]:period’ value in configuration file, and that value affects WiredTiger, Couchstore, and ForestDB only, not LSM-based modules such as RocksDB and LevelDB. As I aforementioned, the checkpoint in WiredTiger is currently used for the restriction of space occupied by log data, which is quite similar to “compaction” process in Couchstore and ForestDB. So they need to have the same period for fair comparison. Please note that the compaction frequency of LSM-based modules is not manually set in this benchmark program. 3.

I agree that logging to physically separate disk will greatly improve the overall performance, but other DB modules also will experience similar improvement if they can use separate disk.

— Reply to this email directly or view it on GitHub https://github.com/couchbaselabs/ForestDB-Benchmark/issues/11#issuecomment-159950917 .

Mark Callaghan mdcallag@gmail.com

keithbostic commented 8 years ago

The reason why both checkpoint and logging are enabled concurrently is because of space issue; as I mentioned in one of comments at #2, the overall space occupied by log files is getting grow continuously, and we need to limit that kind of space increasing for fair comparison with other DB modules.

I would suggest the useful knobs be surfaced into the configuration API and the benchmark monitor the amount of disk space each engine is using over time, that answers much more interesting questions.

Specifically with respect to WiredTiger logging, WiredTiger log files can be compressed at run-time (which the benchmark isn't doing), and simply discarded periodically without affecting transactional guarantees. The interesting question with respect to WiredTiger checkpoints and log files is bounding recovery time on startup, but that's not what this benchmark is measuring.

As I aforementioned, the checkpoint in WiredTiger is currently used for the restriction of space occupied by log data, which is quite similar to “compaction” process in Couchstore and ForestDB.

A WiredTiger checkpoint creates a transactional snapshot and provides specific transactional guarantees; as part of providing those transactional guarantees, it does a lot of work inside the engine at high-priority that can seriously affect throughput. As far as I can tell, Couchbase compaction is the low-priority process of discarding stale data from underlying data files. Yes, both operations affect overall disk space usage, but that's all they have in common.

So they need to have the same period for fair comparison.

I disagree completely -- they are doing completely different tasks in support of completely different goals. Durability and disk space usage are being conflated, and that isn't correct.

I agree that logging to physically separate disk will greatly improve the overall performance, but other DB modules also will experience similar improvement if they can use separate disk.

To the extent engines can put files in different locations, that should be configurable. WiredTiger can put log files on a different disk (and putting WAL files on a different disk is standard for increasing throughput, nobody who cares at all about performance writes WAL files to the same device as the data, that one fact is a fatal flaw in this benchmark).

Implying it levels the playing field if all engines use the same device relies on the assumption that all engines are penalized to the same degree by the restriction. Since the engines write different files for different reasons, that assumption cannot be correct.

greensky00 commented 8 years ago

Mark,

This benchmark program already includes a recoverability option in its configuration file: ‘[operation]:write_type’. If the option is set to ‘sync’, then ‘WriteOptions::sync’ in RocksDB wrapper is set to 1, otherwise it is set to 0. As I remember that the default value of ‘WriteOptions::disableWAL’ is 0, it seems reasonable to say that those options correspond to “durable” and “not durable” options that you mentioned.

Looks like “really not durable” option can be used when we need a volatile (in-memory) key-value store. This benchmark program does not support that option for now since it is not frequently used as you also mentioned, but let me consider adding that option too.

greensky00 commented 8 years ago

Keith,

In my understanding, we can summarize the points as follows:

(1) A WiredTiger checkpoint reduces the overall space usage, but it is just a supplemental behavior; creating transactional snapshots is the most fundamental task of checkpoint, and accordingly it is unfair to use it like “compaction” of Couchstore or ForestDB, as “checkpoint” includes heavier overhead than “compaction”.

(2) WiredTiger can put log files on a different disk, while other DB engines cannot. Restricting those additional (and also unique) features is not fair, as each DB engine uses disks according to its own rules and purposes.

Regarding (1), yes, I understand and agree what you mean. However, it is still not enough to say that it is a fair comparison if the space usage of log files of WiredTiger can largely increase over time (even though they are automatically reclaimed at some time). Both Couchstore and ForestDB also can make their DB files grow much larger than the original working set size by adjusting compaction parameters, and consequently the overall performance can be largely improved as well. I don’t know if the latest WiredTiger still has the same space issue, so let me check of it. I agree that a space monitoring feature needs to be added into the benchmark program. Let me add it soon.

Regarding (2), I partly agree and partly disagree. Of course all DB engines should be configured for their best performance, but the configuration needs to be limited “for given environment”. The major assumption of this benchmark program is that a DB module runs on a single node equipped with a single disk, so DB engines should follow this rule although they are capable to use additional hardware such as different disks or even different machines over network. Sure we can show the superior performance of WiredTiger when it is configured to use different disk for WAL, but those results need to be handled separately from the results based on a single-disk environment.

keithbostic commented 8 years ago

However, it is still not enough to say that it is a fair comparison if the space usage of log files of WiredTiger can largely increase over time (even though they are automatically reclaimed at some time).

Since the benchmark doesn't concern itself with archival, the WiredTiger log files can be discarded; if WiredTiger approaches disk space limits, perform a single checkpoint and remove all log files other than the most recent.

As compression decreases the amount of I/O, WiredTiger log files could be optionally configurable for compression.

Of course all DB engines should be configured for their best performance, but the configuration needs to be limited “for given environment”. The major assumption of this benchmark program is that a DB module runs on a single node equipped with a single disk, so DB engines should follow this rule although they are capable to use additional hardware such as different disks or even different machines over network.

A fair point, but the limitation unavoidably penalizes some engines over others, and where that's the case, I think additional configuration options are strongly justified. (And, I think we agree on that point.) Where we may disagree is I think the high-end server applications these engines target are rarely limited to a single disk, so why make the benchmark's default and only supported configuration a single node?

mdcallag commented 8 years ago

Looking at couch_bench.cc today couchstore_set_flags(0x1) is called before the db is opened for the load and the couchstore_open_db call doesn't not use binfo->sync_write to set a flag so FDB_DRB_ASYNC is used during load. So wal_flush_before_commit is true and durability_opt is FDB_DRB_ASYNC during the load.

What is the behavior with & without periodic commit? I assume that wal sync is done on commit because of couchstore_set_flags(0x1) but I don't know what file structures are sync'd. Without periodic commit enabled there is only one commit -- at the end of the load, so my question is really about what is done when periodic commit is enabled.

What is the difference between FDB_DRB_ASYNC and FDB_DRB_NONE?
There is a log for data and separate file structure(s) for the index. Are they handled differently?

On Sun, Nov 29, 2015 at 8:31 AM, Jung-Sang Ahn notifications@github.com wrote:

Mark,

This benchmark program already includes a recoverability option in its configuration file: ‘[operation]:write_type’. If the option is set to ‘sync’, then ‘WriteOptions::sync’ in RocksDB wrapper is set to 1, otherwise it is set to 0. As I remember that the default value of ‘WriteOptions::disableWAL’ is 0, it seems reasonable to say that those options correspond to “durable” and “not durable” options that you mentioned.

Looks like “really not durable” option can be used when we need a volatile (in-memory) key-value store. This benchmark program does not support that option for now since it is not frequently used as you also mentioned, but let me consider adding that option too.

— Reply to this email directly or view it on GitHub https://github.com/couchbaselabs/ForestDB-Benchmark/issues/11#issuecomment-160430564 .

Mark Callaghan mdcallag@gmail.com

chiyoung commented 8 years ago

If we want to measure the read / write performance without creating periodic transactional snapshots, WiredTiger can be configured based on your suggestion (i.e., issue a single checkpoint upon the disk space limit). However, the performance of periodic transactional snapshot writes / reads should be also separately measured as they represent typical use cases too (e.g., Couchbase secondary index use cases).

Regarding a single disk, each storage engine that is tested in this benchmark can have different degree of benefits from running on multiple disks. We still observed lots of NoSQL deployments in low-to-mid end commodity machines in public cloud environments (e.g., EC2, Google Compute Engine). However, we plan to extend this benchmark framework to provide addition options for supporting the better utilization on high-end machines with multiple disks.

keithbostic commented 8 years ago

@chiyoung, I'm unclear on what you mean by "periodic transactional snapshot writes / reads", would you please describe the functionality you are testing in more detail?

WiredTiger supports standard durability without checkpoint being called.

chiyoung commented 8 years ago

Keith,

I meant typical snapshot isolation / consistency semantics. For example, when a given set of documents are updated in the primary database by a client, then the same client may need to issue queries (point or range) in the secondary indexes or keyword searches in the inverted indexes to see the query results that reflect those document updates. In this case, the immutable and consistent snapshot should be created upon receiving a query request, so that the client can issue various queries on top of the snapshot created. Obviously, there are usually concurrent clients which may need different levels of snapshot isolation, consistency, recoverability.

As I'm not familiar with WiredTiger checkpointing functionalities, I'm not sure if the above use cases can be served without creating a checkpoint in WiredTiger. Please correct me if I misunderstood.

keithbostic commented 8 years ago

I'm not sure if the above use cases can be served without creating a checkpoint in WiredTiger.

The above use cases do not require WiredTiger checkpoints.

greensky00 commented 8 years ago

Mark,

What periodic commit with FDB_DRB_ASYNC option does is reflecting (flushing) WAL entries into the main index (i.e., HB+trie), and then removing WAL. Actually this is done by default in ForestDB if 'wal_flush_before_commit’ option is enabled, thus all code blocks related to periodic commit during population (i.e., ‘binfo->pop_commit’ option) are not necessary anymore; they was added when 'wal_flush_before_commit’ option was not yet provided for users in ForestDB.

What is the difference between FDB_DRB_ASYNC and FDB_DRB_NONE?

The only difference between FDB_DRB_ASYNC and FDB_DRB_NONE is calling fsync(). When fdb_commit() is invoked, WAL entries may be flushed if some conditions are satisfied, and then it calls fsync() if the sync option is FDB_DRB_NONE.

There is a log for data and separate file structure(s) for the index. Are they handled differently?

I don't exactly understand what your 'handled' means; but in terms of file synchronization, they are handled (and synchronized) together.