facebook / rocksdb

A library that provides an embeddable, persistent key-value store for fast storage.
http://rocksdb.org
GNU General Public License v2.0
28.42k stars 6.29k forks source link

IO error: While pread offset 18446744073709547520 len 8192: /000227.sst: Invalid argument @newest version when using those table options #9220

Closed dongdongwcpp closed 2 years ago

dongdongwcpp commented 2 years ago

Expected behavior

rocksdb write ok

Actual behavior

in rocksdb 6.24 buildType: release: IO error: While pread offset 18446744073709547520 len 8192: /tmp/rocksdb_simple_example/000227.sst: Invalid argument in rocksdb 6.24 buildType:Debug: rocksdb/table/block_based/block.h:629: virtual rocksdb::IndexValue rocksdb::IndexBlockIter::value() const: Assertion `Valid()' failed.

Steps to reproduce the behavior

init rocksdb like this,the five table_options 's combination will reproduce the failure.

#include <cstdio>
#include <string>
#include "rocksdb/db.h"
#include "rocksdb/statistics.h"
#include "cache/cache_entry_roles.h"
#include "cache/lru_cache.h"
#include "db/column_family.h"
#include "db/internal_stats.h"

#include "rocksdb/slice.h"
#include "rocksdb/options.h"
#include "rocksdb/table.h"
#include <iostream>
using namespace std;

using namespace ROCKSDB_NAMESPACE;

std::string kDBPath = "/tmp/rocksdb_simple_example";
int main(int argc, char* argv[]) {
  DB* db;
  Options options;  
  rocksdb::BlockBasedTableOptions table_options;
  // five table_options 's combination case failure
  table_options.filter_policy.reset(rocksdb::NewBloomFilterPolicy(10, false));
  table_options.partition_filters = true;
  table_options.prepopulate_block_cache = rocksdb::BlockBasedTableOptions::PrepopulateBlockCache::kFlushOnly;
  table_options.cache_index_and_filter_blocks =true;
  table_options.index_type = rocksdb::BlockBasedTableOptions::IndexType::kTwoLevelIndexSearch;

  options.table_factory.reset(NewBlockBasedTableFactory(table_options));

  // create the DB if it's not already present
  options.create_if_missing = true;

  // open DB
  Status s = DB::Open(options, kDBPath, &db);
  assert(s.ok());
  WriteOptions wop;
  wop.disableWAL = true;
    // Put key-value
    string value;
    value.assign(1024*1024,'a');
    s = db->Put(wop, "key11" + std::to_string(i), value);
}
hx235 commented 2 years ago

@dongdongwcpp Thanks for the repro code!!! Seems like a flush is also needed to trigger this assertion - either by putting s = db->Put(wop, "key11" + std::to_string(i), value); in a big loop on i or doing something like db->Flush(FlushOptions()) after any number of Put (even Put once can trigger the assertion).

I also verified removing any of the partition_filters + index_type, cache_index_and_filter_blocks, cache_index_and_filter_blocks won't trigger the assertion. So I suspect either one of them might not be compatible with others.

===== Below are some debugging/impl details for my teammates - feel free to ignore it @dongdongwcpp :) =====

@akankshamahajan15 Is the feature "block cache pre-population" now supporting partition filter? Adding table_options.partition_filters = true; table_options.index_type = rocksdb::BlockBasedTableOptions::IndexType::kTwoLevelIndexSearch; to TEST_F(DBBlockCacheTest, WarmCacheWithBlocksDuringFlush) on 6.24.fb branch exposes the same error and a slightly different error for the latest upstream/main. I believe both errors are related to rocksdb::PartitionedFilterBlockReader::CacheDependencies(rocksdb::ReadOptions const&, bool).

I did some initial debugging on 6.24.fb branch:

(cc @ltamasi FYI)

ltamasi commented 2 years ago

Thanks for the deep dive @hx235 ! Based on your analysis above, the offending block is definitely the top-level index of the partitioned filters, for two reasons: 1) that's the only "index-like" block when it comes to partitioned filters; you can't iterate over a filter partition and 2) that's the block that is handled by GetOrReadFilterBlock when partitioned filters are in use.

BTW, this branch in GetOrReadFilterBlock you mentioned kicks in when the partitioned filter reader has direct access to the top-level index, i.e. when a) cache_index_and_filter_blocks is false, or b) cache_index_and_filter_blocks is true and the the top-level index is pinned in the cache. If filter_block_ here refers to a zero-sized block, that shouldn't happen and probably points to an issue with the preloading code.

ltamasi commented 2 years ago

Based on my cursory reading of the cache pre-populating logic, the issue seems to be in BlockBasedTableBuilder::InsertBlockInCacheHelper: namely, I think we're missing a special case for the top-level index of partitioned filters (which is conceptually a quote-unquote filter block but format-wise is a regular key-value Block).

akankshamahajan15 commented 2 years ago

Thanks @hx235 and @ltamasi for the detailed explanation. I will look into it.

pdillinger commented 2 years ago

This surely would have been caught if prepopulate_block_cache was added to db_stress test. (Reviewers should have noticed that.)

akankshamahajan15 commented 2 years ago

This surely would have been caught if prepopulate_block_cache was added to db_stress test. (Reviewers should have noticed that.)

Yes. After fixing it, I will add it to the db_stress as well and will keep this in mind for the future implementations as well.