facebook / rocksdb

A library that provides an embeddable, persistent key-value store for fast storage.
http://rocksdb.org
GNU General Public License v2.0
28.57k stars 6.31k forks source link

No sst file generated after put data to rocksdb but only log file #4097

Closed huor closed 3 years ago

huor commented 6 years ago

Expected behavior

If it follows the step to open rocksdb, put data, get data, close rocksdb, there should be ssl file generated. However, it does not. The rocksdb version is 5.10.3.

If it follows the step to open rocksdb, put data, close rocksdb, then open rocksdb again, get data, close rocksdb, there is ssl file generated as expected. But it might take some long time to recover when open rocksdb for get data, especially if the data volume put to rocksdb is relatively large.

Actual behavior

No ssl file generated after put data to rocksdb. Only log file is present.

$ ls -alt /tmp/data/
total 80
drwxr-xr-x   2 wheel    306 Jul  6 17:27 .
drwxrwxrwt  10 wheel   3944 Jul  6 17:27 ..
-rw-r--r--   1 wheel     28 Jul  6 17:27 000003.log
-rw-r--r--   1 wheel     16 Jul  6 17:27 CURRENT
-rw-r--r--   1 wheel     33 Jul  6 17:27 IDENTITY
-rw-r--r--   1 wheel      0 Jul  6 17:27 LOCK
-rw-r--r--   1 wheel  14982 Jul  6 17:27 LOG
-rw-r--r--   1 wheel     13 Jul  6 17:27 MANIFEST-000001
-rw-r--r--   1 wheel   4663 Jul  6 17:27 OPTIONS-000005

However, if it follows the step to open rocksdb, put data, close rocksdb, open it again, get data, close rocksdb. The ssl file is generated as expected.

$ ls -alt /tmp/data/
total 128
drwxr-xr-x   2 wheel    408 Jul  6 17:45 .
drwxrwxrwt  10 wheel   3944 Jul  6 17:45 ..
-rw-r--r--   1 wheel    951 Jul  6 17:45 000004.sst
-rw-r--r--   1 wheel      0 Jul  6 17:45 000006.log
-rw-r--r--   1 wheel     16 Jul  6 17:45 CURRENT
-rw-r--r--   1 wheel     33 Jul  6 17:45 IDENTITY
-rw-r--r--   1 wheel      0 Jul  6 17:45 LOCK
-rw-r--r--   1 wheel  16049 Jul  6 17:45 LOG
-rw-r--r--   1 wheel  14982 Jul  6 17:45 LOG.old.1530870336298085
-rw-r--r--   1 wheel     90 Jul  6 17:45 MANIFEST-000005
-rw-r--r--   1 wheel   4663 Jul  6 17:45 OPTIONS-000005
-rw-r--r--   1 wheel   4663 Jul  6 17:45 OPTIONS-000008

Steps to reproduce the behavior

  1. Here is the c++ code for reproduction. There is no sst file even if the data volume put to rocksdb is much larger, i.e, about 1G.

    // setup environment
    assert(system("rm -rf /tmp/data") == 0);
    assert(system("mkdir -p /tmp/data") == 0);
    
    // STEP 1. prepare physical location for rocksdb instance
    std::string path("/tmp/data");
    
    // STEP 2. open rocksdb instance with specific location
    rocksdb::DB *db;
    rocksdb::Options options;
    
    options.compaction_style = rocksdb::kCompactionStyleLevel;
    options.compression_per_level.resize(options.num_levels);
    for (int i = 0; i < options.num_levels; ++i) {
    options.compression_per_level[i] = rocksdb::kLZ4Compression;
    }
    
    rocksdb::BlockBasedTableOptions table_options;
    table_options.block_size = 32 * 1024;
    options.table_factory.reset(
      rocksdb::NewBlockBasedTableFactory(table_options));
    options.IncreaseParallelism();
    options.create_if_missing = true;
    
    rocksdb::Status s = rocksdb::DB::Open(options, path, &db);
    assert(s.ok() == true);
    
    // STEP 3. put key value to rocksdb
    std::string key = "key";
    std::string val = "val";
    rocksdb::Slice ks(key.c_str(), key.size());
    rocksdb::Slice vs(val.c_str(), val.size());
    rocksdb::WriteBatch writeBatch;
    writeBatch.Put(ks, vs);
    s = db->Write(rocksdb::WriteOptions(), &writeBatch);
    assert(s.ok() == true);
    
    // STEP 4. prepare snapshot for get from rocksdb
    rocksdb::ReadOptions readOptions;
    bool keepSnapshot = true;
    if (keepSnapshot) {
    readOptions.snapshot = db->GetSnapshot();
    }
    std::unique_ptr<rocksdb::Iterator> readIter;
    readIter.reset(db->NewIterator(readOptions));
    
    // STEP 5. get key value from rocksdb
    std::string value;
    s = db->Get(readOptions, ks, &value);
    assert(s.ok() == true);
    assert(value == "val");
    
    // STEP 6. release snapshot in rocksdb for get
    if (readOptions.snapshot != nullptr) {
    db->ReleaseSnapshot(readOptions.snapshot);
    }
    readIter.reset();
    
    // STEP 7. close rocksdb instance
    if (db != nullptr) {
    delete db;
    db = nullptr;
    }
  2. The sst file is generated if it follows the step to open rocksdb, put data, close rocksdb, then open rocksdb again, get data, close rocksdb.

    // setup environment
    assert(system("rm -rf /tmp/data") == 0);
    assert(system("mkdir -p /tmp/data") == 0);
    
    // STEP 1. prepare physical location for rocksdb instance
    std::string path("/tmp/data");
    
    // STEP 2. open rocksdb instance with specific location for put
    rocksdb::DB *db;
    rocksdb::Options options;
    
    options.compaction_style = rocksdb::kCompactionStyleLevel;
    options.compression_per_level.resize(options.num_levels);
    for (int i = 0; i < options.num_levels; ++i) {
    options.compression_per_level[i] = rocksdb::kLZ4Compression;
    }
    
    rocksdb::BlockBasedTableOptions table_options;
    table_options.block_size = 32 * 1024;
    options.table_factory.reset(
      rocksdb::NewBlockBasedTableFactory(table_options));
    options.IncreaseParallelism();
    options.create_if_missing = true;
    
    rocksdb::Status s = rocksdb::DB::Open(options, path, &db);
    assert(s.ok() == true);
    
    // STEP 3. put key value to rocksdb
    std::string key = "key";
    std::string val = "val";
    rocksdb::Slice ks(key.c_str(), key.size());
    rocksdb::Slice vs(val.c_str(), val.size());
    rocksdb::WriteBatch writeBatch;
    writeBatch.Put(ks, vs);
    s = db->Write(rocksdb::WriteOptions(), &writeBatch);
    assert(s.ok() == true);
    
    // STEP 4. close rocksdb instance for put
    if (db != nullptr) {
    delete db;
    db = nullptr;
    }
    
    // STEP 5. open rocksdb instance for get
    rocksdb::Status s = rocksdb::DB::Open(options, path, &db);
    assert(s.ok() == true);
    
    // STEP 6. prepare snapshot for get from rocksdb
    rocksdb::ReadOptions readOptions;
    bool keepSnapshot = true;
    if (keepSnapshot) {
    readOptions.snapshot = db->GetSnapshot();
    }
    std::unique_ptr<rocksdb::Iterator> readIter;
    readIter.reset(db->NewIterator(readOptions));
    
    // STEP 7. get key value from rocksdb
    std::string value;
    s = db->Get(readOptions, ks, &value);
    assert(s.ok() == true);
    assert(value == "val");
    
    // STEP 8. release snapshot in rocksdb for get
    if (readOptions.snapshot != nullptr) {
    db->ReleaseSnapshot(readOptions.snapshot);
    }
    readIter.reset();
    
    // STEP 9. close rocksdb instance
    if (db != nullptr) {
    delete db;
    db = nullptr;
    }
miasantreble commented 6 years ago

you can either do manual flush by calling Flush() or close the db by calling Close() See https://github.com/facebook/rocksdb/blob/master/include/rocksdb/db.h for all DB related APIs

You also mentioned even with 1G written there is still no flush happening, which is unexpected because with default write_buffer_size at 64MB the automatic flush should be triggered well before that. Do you mind posting your OPTIONS file? We can check if there is any configuration that caused this.

huor commented 6 years ago

Thanks @miasantreble for the suggestion!

Say we have three solutions here:

I did some investigation with about 600M raw data, which is about 300M sst file after written to rocksdb with my compression setting. Here are my findings:

Would you please shed some light on how to ensure data is persistent to sst as well as having best performance gain? Thanks in advance.

The rocksdb is configure with write_buffer_size=67108864, and db_write_buffer_size=0 in my environment. Please find below detail OPTIONS for your reference.

$ cat /tmp/data/OPTIONS-000005
# This is a RocksDB option file.
#
# For detailed file format spec, please refer to the example file
# in examples/rocksdb_option_file_example.ini
#

[Version]
  rocksdb_version=5.10.3
  options_file_version=1.1

[DBOptions]
  allow_mmap_writes=false
  base_background_compactions=-1
  new_table_reader_for_compaction_inputs=false
  db_log_dir=
  wal_recovery_mode=kPointInTimeRecovery
  use_direct_reads=false
  write_thread_max_yield_usec=100
  max_manifest_file_size=18446744073709551615
  allow_2pc=false
  allow_fallocate=true
  fail_if_options_file_error=false
  allow_ingest_behind=false
  allow_mmap_reads=false
  skip_log_error_on_recovery=false
  recycle_log_file_num=0
  delete_obsolete_files_period_micros=21600000000
  compaction_readahead_size=0
  use_direct_io_for_flush_and_compaction=false
  log_file_time_to_roll=0
  create_missing_column_families=false
  advise_random_on_open=true
  max_log_file_size=0
  stats_dump_period_sec=600
  enable_thread_tracking=false
  use_adaptive_mutex=false
  create_if_missing=true
  is_fd_close_on_exec=true
  max_background_flushes=-1
  manifest_preallocation_size=4194304
  error_if_exists=false
  skip_stats_update_on_db_open=false
  max_open_files=-1
  random_access_max_buffer_size=1048576
  use_fsync=false
  max_background_jobs=16
  two_write_queues=false
  max_background_compactions=-1
  max_file_opening_threads=16
  table_cache_numshardbits=6
  keep_log_file_num=1000
  avoid_flush_during_shutdown=false
  db_write_buffer_size=0
  max_total_wal_size=0
  wal_dir=/tmp/data
  max_subcompactions=1
  WAL_size_limit_MB=0
  paranoid_checks=true
  allow_concurrent_memtable_write=true
  writable_file_max_buffer_size=1048576
  WAL_ttl_seconds=0
  delayed_write_rate=16777216
  bytes_per_sync=0
  wal_bytes_per_sync=0
  enable_pipelined_write=false
  enable_write_thread_adaptive_yield=true
  write_thread_slow_yield_usec=3
  access_hint_on_compaction_start=NORMAL
  info_log_level=INFO_LEVEL
  dump_malloc_stats=false
  avoid_flush_during_recovery=false
  preserve_deletes=false
  manual_wal_flush=false

[CFOptions "default"]
  report_bg_io_stats=false
  inplace_update_support=false
  max_compaction_bytes=1677721600
  disable_auto_compactions=false
  write_buffer_size=67108864
  bloom_locality=0
  max_bytes_for_level_multiplier=10.000000
  compaction_filter_factory=nullptr
  optimize_filters_for_hits=false
  target_file_size_base=67108864
  max_write_buffer_number_to_maintain=0
  hard_pending_compaction_bytes_limit=274877906944
  paranoid_file_checks=false
  memtable_prefix_bloom_size_ratio=0.000000
  force_consistency_checks=false
  max_write_buffer_number=2
  max_bytes_for_level_multiplier_additional=1:1:1:1:1:1:1
  level0_slowdown_writes_trigger=20
  level_compaction_dynamic_level_bytes=false
  compaction_options_fifo={allow_compaction=false;ttl=0;max_table_files_size=1073741824;}
  inplace_update_num_locks=10000
  level0_file_num_compaction_trigger=4
  compression=kSnappyCompression
  level0_stop_writes_trigger=36
  num_levels=7
  table_factory=BlockBasedTable
  compression_per_level=kLZ4Compression:kLZ4Compression:kLZ4Compression:kLZ4Compression:kLZ4Compression:kLZ4Compression:kLZ4Compression
  target_file_size_multiplier=1
  min_write_buffer_number_to_merge=1
  arena_block_size=8388608
  max_successive_merges=0
  memtable_huge_page_size=0
  compaction_pri=kByCompensatedSize
  soft_pending_compaction_bytes_limit=68719476736
  max_bytes_for_level_base=268435456
  comparator=leveldb.BytewiseComparator
  max_sequential_skip_in_iterations=8
  bottommost_compression=kDisableCompressionOption
  prefix_extractor=nullptr
  memtable_insert_with_hint_prefix_extractor=nullptr
  memtable_factory=SkipListFactory
  compaction_filter=nullptr
  compaction_options_universal={allow_trivial_move=false;stop_style=kCompactionStopStyleTotalSize;min_merge_width=2;compression_size_percent=-1;max_size_amplification_percent=200;max_merge_width=4294967295;size_ratio=1;}
  merge_operator=nullptr
  compaction_style=kCompactionStyleLevel

[TableOptions/BlockBasedTable "default"]
  format_version=2
  whole_key_filtering=true
  verify_compression=false
  partition_filters=false
  index_block_restart_interval=1
  block_size_deviation=10
  block_size=32768
  pin_l0_filter_and_index_blocks_in_cache=false
  block_restart_interval=16
  filter_policy=nullptr
  metadata_block_size=4096
  no_block_cache=false
  checksum=kCRC32c
  read_amp_bytes_per_bit=8589934592
  cache_index_and_filter_blocks=false
  index_type=kBinarySearch
  hash_index_allow_collision=true
  cache_index_and_filter_blocks_with_high_priority=false
  flush_block_policy_factory=FlushBlockBySizePolicyFactory
szstonelee commented 4 years ago

I have the same issue. When calling db->Put(), no SST are generated in the folder. I can only list LOG, MANIFEST, CURRENT, IDENTITY, LOCK, OPTIONS.

It only happened in the Linux VM (Ubuntu 18.04, using Multipass) on my MAC. When I ran the application directly in MAC OS 10.15, I saw the SST files.

I tried with and without WAL, sync and no sync in each Put().

riversand963 commented 4 years ago

I have the same issue. When calling db->Put(), no SST are generated in the folder. I can only list LOG, MANIFEST, CURRENT, IDENTITY, LOCK, OPTIONS.

It only happened in the Linux VM (Ubuntu 18.04, using Multipass) on my MAC. When I ran the application directly in MAC OS 10.15, I saw the SST files.

I tried with and without WAL, sync and no sync with each Put(). The result were same in VM and MAC OS.

What is the value of avoid_flush_during_shutdown in your test? If you reopen the db, can you find all the data previously written?

szstonelee commented 4 years ago

@riversand963 Thank you for the quick reply.

I tried again.

As avoid_flush_during_shutdown is not explicitly set in my application, the default value should be false.

I tried with WAL enabled and disabled (write option of sync is true for each put). All results are same, only LOG files (Every log record is 0.0 for write and compaction), no SST files.

There are two test situations. One is the Ubuntu VM in Mac, the other is the docker in the Ubuntu VM. The results are same as above. Ubuntu VM is supported by Multipass.

Only when I run my application in MAC OS, I can see the SST files instantly. I do not have pure Ubuntu, so I can not tell what will happen for that.

BTW:

  1. If I run my application for a long time, which keeps inserting new keys to Rocksdb, it will be OOM after a couple of minutes in VM environment and I can hear some sound like HDD spinning (but my MAC does not have HDD, only SSD).

  2. Even I safely close the database by deleting the database object, no SST at all.

szstonelee commented 4 years ago

@riversand963 I found it is something related to Jemalloc. When I changed the memory allocation library from Jemalloc to libc, the .sst file started to show up in the Rocksdb folder for Linux VM in MacOS. Because MacOS does not use Jemalloc in my Makefile, it did not happen in MacOS as I described above.

riversand963 commented 3 years ago

@szstonelee Can you provide more details on why Jemalloc will trigger this issue?

szstonelee commented 3 years ago

@riversna963 I do not know why. I only know that in the Multipass VM of Ubuntu running in my Host OS of MacOS, when my code in RedRock (https://github.com/szstonelee/RedRock) calling RocksDB API with Jemalloc, it does not generate any SST files. But if nothing changed but only switching to libc (by modifying src/makefile),the SST files show up. It may be some issues relating to Multipass, or MacOS or Jemalloc or RocksDB, but I do not know which one is the cause so I report the bug here.

riversand963 commented 3 years ago

@szstonelee I see. Do you plan to continue investigation and share more info?

szstonelee commented 3 years ago

Sorry right now I have no time for investigation for the issue. In future, If I find something new or fix the bug, I will let you know.

riversand963 commented 3 years ago

There are a number of variables here, and depending on the current available information, I do not have a good theory about the cause. Since nobody is actively investigating, I'll close this for now. Feel free to reopen if it still affects you.

IloveKanade commented 2 years ago

Thanks @miasantreble for the suggestion!

Say we have three solutions here:

  • Solution 1: open rocksdb, write flush, read, close rocsdb
  • Solution 2: open rocksdb, write without flush, close rocsdb, open rocksdb, read, close rocsdb
  • Solution 3: open rocksdb, write with flush, read, close rocsdb

HI, is this problem resolved? The same thing happened in my environment, I exceeded write_buffer_size when I used it, but still no SST file was generated