Closed yuslepukhin closed 6 years ago
Taking a look ...
I think AlignedBuffer is not used properly in a couple of places. The notion of the class encapsulation is that it knows its size and would restrict operations based on what it knows. The example includes Read() method. It would never read more than the buffer has and will return the actual amount. The recent changes violate that principle.
Another example of improper use is at https://github.com/facebook/rocksdb/blob/master/util/file_reader_writer.h#L214
There FilePrefetchBuffer has a member bufferlen which is a duplicate of what AlignedBuffer has. The code updates FilePrefertchBuffer member but not the actual buffer. They never agree on each other because when the data is written directly into the AlignedBuffer the buffer is never told its length by calling Size(size_t) method. The end result is that the AlignedBuffer does not know its proper length and can not operate properly.
I haven't been able to reproduce the crash/access_violation on Linux, running on even a much large db and larger number of ops. I have also been running a few stress tests over the past few days without an issue. Adding more tests and tweaks to see if I can repro this, and also trying to setup a windows environment.
@sagar0 I have a 100% repro case. I suggest that you simply run under the debugger and put a conditional breakpoint (or an assert) for the condition outlined in the above scenario. I am lucky to repro because the beyond the buffer access falls beyond the mapped memory region on windows but in a different setup the memory can be mapped and thus no immediate impact.
@sagar0 The main issue is that the buffer access is driven from the outside and the internal knowledge of the buffer is neither consulted nor enforced, thus anyone can supply arbitrary offset and data size to copy.
@sagar0 I think this is exactly the problem
@yuslepukhin I investigated more into the issue yesterday and am currently working on a fix.
I am going to PR my proposal.
Created #4105 to remove external tracking of AlignedBuffer size. Do you think we'd need more than that?
@sagar0 I think, what needs to be done is the actual size to be copied to be enforced by std::min/max functions based on the current data size and buffer capacity. And then return the actual size copied. This way, the caller can assert() to see if his expectations are met.
@sagar0 Also there are many spaces in encryption code (!) where the buffer size is not set. Again, once copying is enforced, it is going to break
I missed making changes in the env_encryption code as I was solely concentrating on file_reader_writer APIs. I'll update env_encryption code as well.
@sagar0 So providing tests pass and encryption is addressed, the only thing is to return the number of bytes copied from Allocate() so the caller could verify it meets the expections.
@sagar0 With regards to encryption code. It a lot of times simply memmove data into the buffer. One can simply call Append() and it would take care of the size and everything else.
There are multiple different problems we found while investigating this issue. Let me summarize our discussion based on my understanding so far:
@sagar0 This is spot on! Thank you for attending to this.
Expected behavior
db_bench does not core dump
Actual behavior
db_bench when running seekrandom produces access violation (jemalloc build)
In the below code the current size and the capacity of the buffer is 12288, however, the memcpy attempts to copy bufstart_ + copy_offset (12288), copy_len(12288) from the old buffer resulting in out of buffer access.
Steps to reproduce the behavior
--benchmarks=fillrandom --mmap_read=0 --statistics=1 --histogram=1 --num=10000000 --threads=1 --prefix_size=0 --keys_per_prefix=40 --value_size=800 --block_size=4096 --cache_size=1048576 --bloom_bits=10 --memtable_bloom_size_ratio=0 --cache_numshardbits=6 --open_files=500000 --verify_checksum=1 --db=d:\mnt\RandomSeekNext_10M --sync=0 --disable_wal=1 --compression_type=snappy --stats_interval=1000000 --compression_ratio=0.5 --write_buffer_size=67108864 --target_file_size_base=67108864 --max_write_buffer_number=3 --max_background_compactions=20 --max_background_flushes=4 --enable_pipelined_write=false --allow_concurrent_memtable_write=false --enable_write_thread_adaptive_yield=true --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=8 --level0_stop_writes_trigger=12 --num_levels=6 --delete_obsolete_files_period_micros=300000000 --min_level_to_compress=2 --max_compaction_bytes=0 --stats_per_interval=1 --max_bytes_for_level_base=10485760 --use_existing_db=0 --use_direct_reads=1 --use_direct_io_for_flush_and_compaction=1 --compaction_readahead_size=3145728 --writable_file_max_buffer_size=1048576 -new_table_reader_for_compaction_inputs=1
Then issue seekrandom scenario (adjust as approp)
--benchmarks=seekrandom --mmap_read=0 --statistics=1 --histogram=1 --num=10000000 --threads=32 --prefix_size=0 --keys_per_prefix=40 --value_size=800 --block_size=4096 --cache_size=1048576 --bloom_bits=10 --memtable_bloom_size_ratio=0 --cache_numshardbits=6 --open_files=500000 --verify_checksum=1 --db=d:\mnt\RandomSeekNext_10M --sync=0 --disable_wal=1 --compression_type=snappy --seek_nexts=40 --stats_interval=1000000 --compression_ratio=0.5 --write_buffer_size=67108864 --target_file_size_base=67108864 --max_write_buffer_number=3 --max_background_compactions=20 --max_background_flushes=4 --enable_pipelined_write=false --allow_concurrent_memtable_write=false --enable_write_thread_adaptive_yield=true --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=8 --level0_stop_writes_trigger=12 --num_levels=6 --delete_obsolete_files_period_micros=300000000 --min_level_to_compress=2 --max_compaction_bytes=0 --stats_per_interval=1 --max_bytes_for_level_base=10485760 --use_existing_db=1 --use_direct_reads=1 --use_direct_io_for_flush_and_compaction=1 --duration=600 --compaction_readahead_size=3145728 --writable_file_max_buffer_size=1048576 --new_table_reader_for_compaction_inputs=1