Bump rocksdbjni from 6.25.3 to 7.5.3

Bumps rocksdbjni from 6.25.3 to 7.5.3.

Release notes

RocksDB 7.5.3

7.5.2 (08/02/2022)

Bug Fixes

Fix a bug starting in 7.4.0 in which some fsync operations might be skipped in a DB after any DropColumnFamily on that DB, until it is re-opened. This can lead to data loss on power loss. (For custom FileSystem implementations, this could lead to FSDirectory::Fsync or FSDirectory::Close after the first FSDirectory::Close; Also, valgrind could report call to close() with fd=-1.)

7.5.1 (08/01/2022)

Bug Fixes

Fix a bug where rate_limiter_parameter is not passed into PartitionedFilterBlockReader::GetFilterPartitionBlock.

7.5.0 (07/15/2022)

New Features

Mempurge option flag experimental_mempurge_threshold is now a ColumnFamilyOptions and can now be dynamically configured using SetOptions().

Support backward iteration when ReadOptions::iter_start_ts is set.

Provide support for ReadOptions.async_io with direct_io to improve Seek latency by using async IO to parallelize child iterator seek and doing asynchronous prefetching on sequential scans.

Added support for blob caching in order to cache frequently used blobs for BlobDB.

User can configure the new ColumnFamilyOptions blob_cache to enable/disable blob caching.

Either sharing the backend cache with the block cache or using a completely separate cache is supported.

A new abstraction interface called BlobSource for blob read logic gives all users access to blobs, whether they are in the blob cache, secondary cache, or (remote) storage. Blobs can be potentially read both while handling user reads (Get, MultiGet, or iterator) and during compaction (while dealing with compaction filters, Merges, or garbage collection) but eventually all blob reads go through Version::GetBlob or, for MultiGet, Version::MultiGetBlob (and then get dispatched to the interface -- BlobSource).

Add experimental tiered compaction feature AdvancedColumnFamilyOptions::preclude_last_level_data_seconds, which makes sure the new data inserted within preclude_last_level_data_seconds won't be placed on cold tier (the feature is not complete).

Public API changes

Add metadata related structs and functions in C API, including

rocksdb_get_column_family_metadata() and rocksdb_get_column_family_metadata_cf() to obtain rocksdb_column_family_metadata_t.

rocksdb_column_family_metadata_t and its get functions & destroy function.

rocksdb_level_metadata_t and its and its get functions & destroy function.

rocksdb_file_metadata_t and its and get functions & destroy functions.

Add suggest_compact_range() and suggest_compact_range_cf() to C API.

When using block cache strict capacity limit (LRUCache with strict_capacity_limit=true), DB operations now fail with Status code kAborted subcode kMemoryLimit (IsMemoryLimit()) instead of kIncomplete (IsIncomplete()) when the capacity limit is reached, because Incomplete can mean other specific things for some operations. In more detail, Cache::Insert() now returns the updated Status code and this usually propagates through RocksDB to the user on failure.

NewClockCache calls temporarily return an LRUCache (with similar characteristics as the desired ClockCache). This is because ClockCache is being replaced by a new version (the old one had unknown bugs) but this is still under development.

Add two functions int ReserveThreads(int threads_to_be_reserved) and int ReleaseThreads(threads_to_be_released) into Env class. In the default implementation, both return 0. Newly added xxxEnv class that inherits Env should implement these two functions for thread reservation/releasing features.

Removed Customizable support for RateLimiter and removed its CreateFromString() and Type() functions.

Bug Fixes

Fix a bug in which backup/checkpoint can include a WAL deleted by RocksDB.

Fix a bug where concurrent compactions might cause unnecessary further write stalling. In some cases, this might cause write rate to drop to minimum.

Fix a bug in Logger where if dbname and db_log_dir are on different filesystems, dbname creation would fail wrt to db_log_dir path returning an error and fails to open the DB.

Fix a CPU and memory efficiency issue introduce by facebook/rocksdb#8336 which made InternalKeyComparator configurable as an unintended side effect

Fix a bug where GenericRateLimiter could revert the bandwidth set dynamically using SetBytesPerSecond() when a user configures a structure enclosing it, e.g., using GetOptionsFromString() to configure an Options that references an existing RateLimiter object.

Behavior Change

In leveled compaction with dynamic levelling, level multiplier is not anymore adjusted due to oversized L0. Instead, compaction score is adjusted by increasing size level target by adding incoming bytes from upper levels. This would deprioritize compactions from upper levels if more data from L0 is coming. This is to fix some unnecessary full stalling due to drastic change of level targets, while not wasting write bandwidth for compaction while writes are overloaded.

For track_and_verify_wals_in_manifest, revert to the original behavior before #10087: syncing of live WAL file is not tracked, and we track only the synced sizes of closed WALs. (PR #10330).

WAL compression now computes/verifies checksum during compression/decompression.

Performance Improvements

Rather than doing total sort against all files in a level, SortFileByOverlappingRatio() to only find the top 50 files based on score. This can improve write throughput for the use cases where data is loaded in increasing key order and there are a lot of files in one LSM-tree, where applying compaction results is the bottleneck.

In leveled compaction, L0->L1 trivial move will allow more than one file to be moved in one compaction. This would allow L0 files to be moved down faster when data is loaded in sequential order, making slowdown or stop condition harder to hit. Also seek L0->L1 trivial move when only some files qualify.

In leveled compaction, try to trivial move more than one files if possible, up to 4 files or max_compaction_bytes. This is to allow higher write throughput for some use cases where data is loaded in sequential order, where appying compaction results is the bottleneck.

RocksDB 7.4.5

... (truncated)

Changelog

Sourced from rocksdbjni's changelog.

Rocksdb Change Log

Unreleased

Bug Fixes

Fixed a hang when an operation such as GetLiveFiles or CreateNewBackup is asked to trigger and wait for memtable flush on a read-only DB. Such indirect requests for memtable flush are now ignored on a read-only DB.

Fixed bug where FlushWAL(true /* sync */) (used by GetLiveFilesStorageInfo(), which is used by checkpoint and backup) could cause parallel writes at the tail of a WAL file to never be synced.

Fix periodic_task unable to re-register the same task type, which may cause SetOptions() fail to update periodical_task time like: stats_dump_period_sec, stats_persist_period_sec.

Fixed a bug in the rocksdb.prefetched.bytes.discarded stat. It was counting the prefetch buffer size, rather than the actual number of bytes discarded from the buffer.

Fix bug where the directory containing CURRENT can left unsynced after CURRENT is updated to point to the latest MANIFEST, which leads to risk of unsync data loss of CURRENT.

Update rocksdb.multiget.io.batch.size stat in non-async MultiGet as well.

Public API changes

Add rocksdb_column_family_handle_get_id, rocksdb_column_family_handle_get_name to get name, id of column family in C API

Add a new stat rocksdb.async.prefetch.abort.micros to measure time spent waiting for async prefetch reads to abort

Java API Changes

Add CompactionPriority.RoundRobin.

Revert to using the default metadata charge policy when creating an LRU cache via the Java API.

Behavior Change

Right now, when the option migration tool (OptionChangeMigration()) migrates to FIFO compaction, it compacts all the data into one single SST file and move to L0. This might create a problem for some users: the giant file may be soon deleted to satisfy max_table_files_size, and might cayse the DB to be almost empty. We change the behavior so that the files are cut to be smaller, but these files might not follow the data insertion order. With the change, after the migration, migrated data might not be dropped by insertion order by FIFO compaction.

New Features

RocksDB does internal auto prefetching if it notices 2 sequential reads if readahead_size is not specified. New option num_file_reads_for_auto_readahead is added in BlockBasedTableOptions which indicates after how many sequential reads internal auto prefetching should be start (default is 2).

Performance Improvements

Iterator performance is improved for DeleteRange() users. Internally, iterator will skip to the end of a range tombstone when possible, instead of looping through each key and check individually if a key is range deleted.

7.6.0 (08/19/2022)

New Features

Added prepopulate_blob_cache to ColumnFamilyOptions. If enabled, prepopulate warm/hot blobs which are already in memory into blob cache at the time of flush. On a flush, the blob that is in memory (in memtables) get flushed to the device. If using Direct IO, additional IO is incurred to read this blob back into memory again, which is avoided by enabling this option. This further helps if the workload exhibits high temporal locality, where most of the reads go to recently written data. This also helps in case of the remote file system since it involves network traffic and higher latencies.

Support using secondary cache with the blob cache. When creating a blob cache, the user can set a secondary blob cache by configuring secondary_cache in LRUCacheOptions.

Charge memory usage of blob cache when the backing cache of the blob cache and the block cache are different. If an operation reserving memory for blob cache exceeds the avaible space left in the block cache at some point (i.e, causing a cache full under LRUCacheOptions::strict_capacity_limit = true), creation will fail with Status::MemoryLimit(). To opt in this feature, enable charging CacheEntryRole::kBlobCache in BlockBasedTableOptions::cache_usage_options.

Improve subcompaction range partition so that it is likely to be more even. More evenly distribution of subcompaction will improve compaction throughput for some workloads. All input files' index blocks to sample some anchor key points from which we pick positions to partition the input range. This would introduce some CPU overhead in compaction preparation phase, if subcompaction is enabled, but it should be a small fraction of the CPU usage of the whole compaction process. This also brings a behavier change: subcompaction number is much more likely to maxed out than before.

Add CompactionPri::kRoundRobin, a compaction picking mode that cycles through all the files with a compact cursor in a round-robin manner. This feature is available since 7.5.

Provide support for subcompactions for user_defined_timestamp.

Added an option memtable_protection_bytes_per_key that turns on memtable per key-value checksum protection. Each memtable entry will be suffixed by a checksum that is computed during writes, and verified in reads/compaction. Detected corruption will be logged and with corruption status returned to user.

Added a blob-specific cache priority level - bottom level. Blobs are typically lower-value targets for caching than data blocks, since 1) with BlobDB, data blocks containing blob references conceptually form an index structure which has to be consulted before we can read the blob value, and 2) cached blobs represent only a single key-value, while cached data blocks generally contain multiple KVs. The user can specify the new option low_pri_pool_ratio in LRUCacheOptions to configure the ratio of capacity reserved for low priority cache entries (and therefore the remaining ratio is the space reserved for the bottom level), or configuring the new argument low_pri_pool_ratio in NewLRUCache() to achieve the same effect.

Public API changes

Removed Customizable support for RateLimiter and removed its CreateFromString() and Type() functions.

CompactRangeOptions::exclusive_manual_compaction is now false by default. This ensures RocksDB does not introduce artificial parallelism limitations by default.

Tiered Storage: change bottommost_temperture to last_level_temperture. The old option name is kept only for migration, please use the new option. The behavior is changed to apply temperature for the last_level SST files only.

Added a new experimental ReadOption flag called optimize_multiget_for_io, which when set attempts to reduce MultiGet latency by spawning coroutines for keys in multiple levels.

Bug Fixes

Fix a bug starting in 7.4.0 in which some fsync operations might be skipped in a DB after any DropColumnFamily on that DB, until it is re-opened. This can lead to data loss on power loss. (For custom FileSystem implementations, this could lead to FSDirectory::Fsync or FSDirectory::Close after the first FSDirectory::Close; Also, valgrind could report call to close() with fd=-1.)

Fix a bug where GenericRateLimiter could revert the bandwidth set dynamically using SetBytesPerSecond() when a user configures a structure enclosing it, e.g., using GetOptionsFromString() to configure an Options that references an existing RateLimiter object.

Fix race conditions in GenericRateLimiter.

Fix a bug in FIFOCompactionPicker::PickTTLCompaction where total_size calculating might cause underflow

Fix data race bug in hash linked list memtable. With this bug, read request might temporarily miss an old record in the memtable in a race condition to the hash bucket.

... (truncated)

Commits

540d5aa Bump version.h to 7.5.3
0e860cb Fix regression issue of too large score (#10518)
dcd4435 Update HISTORY and version.h for 7.5.2
b91628b Fix serious FSDirectory use-after-Close bug (missing fsync) (#10460)
be73bf1 Merge pull request #10459 from gitbw95/patch_7.5.1_1
35bc988 Update History.md and version.h for 7.5.1
7c138a6 Update passing rate_limiter_priority for a PartitionedFilterBlockReader funct...
354fa5f Fix seqno->time worker not scheduled with multi DB instances (#10383)
a58fae9 Make RateLimiter not Customizable (#10378)
70ba7cb Fix hang in MultiRead with O_DIRECT and io_uring (#10368)
Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

VIDA-NYU / ache

Bump rocksdbjni from 6.25.3 to 7.5.3 #303

RocksDB 7.5.3

7.5.2 (08/02/2022)

Bug Fixes

7.5.1 (08/01/2022)

Bug Fixes

7.5.0 (07/15/2022)

New Features

Public API changes

Bug Fixes

Behavior Change

Performance Improvements

RocksDB 7.4.5

Rocksdb Change Log

Unreleased

Bug Fixes

Public API changes

Java API Changes

Behavior Change

New Features

Performance Improvements

7.6.0 (08/19/2022)

New Features

Public API changes

Bug Fixes