facebook / rocksdb

A library that provides an embeddable, persistent key-value store for fast storage.
http://rocksdb.org
GNU General Public License v2.0
28.23k stars 6.26k forks source link

Failing unit tests while running makecheck #11392

Open jehangiriqbal opened 1 year ago

jehangiriqbal commented 1 year ago

Summary

I am creating this issue because I stumbled upon an issue in RocksDB release.

Some unit tests are failing (complete list below) on on Amazon Linux EC2 instance of type i3.8xlarge (the steps to reproduce the issue and sample docker file is attached). The tests are failing because they are using std::chrono::steady_clock to measure durations in nanoseconds. Consecutive calls to steady_clock::now() are returning the same time point frequently and failing the assertion. I would suggest to remove those tests or modify the tests to make sure that they can take up some time on fastest of machines and time reported is never zero.

I am currently running RocksDB make check on an AmazonEC2 instance with the following configuration

Expected behavior

All tests should pass

Actual behavior

I run into the failures for the following test cases, complete buildlog.txt is here

[ RUN      ] EnvTest.GenerateRawUniqueIdTrackEnvDetailsOnly
env/env_test.cc:2962: Failure
Expected equality of these values:
Inids.size()
    Which is: 7500
  kThreads * kIdsPerThrea
    Which is: 8000
terminate called after throwing an instance of 'testing::internal::GoogleTestFailureException'
  what():  env/env_test.cc:2962: Failurelibthread_db.so.1".
Expected equality of these values:
  ids.size()
    Which is: 7500_
  kThreads * kIdsPerThread
    Which is: 8000
Received signal 6 (Aborted)

[ RUN      ] DBTest2.TestPerfContextIterCpuTime
db/db_test2.cc:2518: Failure
Expected: (get_perf_context()->iter_next_cpu_nanos) > (0), actual: 0 vs 0
terminate called after throwing an instance of 'testing::internal::GoogleTestFailureException'
  what():  db/db_test2.cc:2518: Failure
Expected: (get_perf_context()->iter_next_cpu_nanos) > (0), actual: 0 vs 0
Received signal 6 (Aborted)

[ RUN      ] EventListenerTest.OnFileOperationTest
db/listener_test.cc:1166: Failure
Expected: (info.duration.count()) > (0), actual: 0 vs 0
terminate called after throwing an instance of 'testing::internal::GoogleTestFailureException'
  what():  db/listener_test.cc:1166: Failure
Expected: (info.duration.count()) > (0), actual: 0 vs 0
Received signal 6 (Aborted)

[ RUN      ] EventListenerTest.OnBlobFileOperationTest
db/listener_test.cc:1166: Failure
Expected: (info.duration.count()) > (0), actual: 0 vs 0
terminate called after throwing an instance of 'testing::internal::GoogleTestFailureException'
  what():  db/listener_test.cc:1166: Failure
Expected: (info.duration.count()) > (0), actual: 0 vs 0
Received signal 6 (Aborted)

[ RUN      ] EventListenerTest.ReadManifestAndWALOnRecovery
db/listener_test.cc:1166: Failure
Expected: (info.duration.count()) > (0), actual: 0 vs 0
terminate called after throwing an instance of 'testing::internal::GoogleTestFailureException'
  what():  db/listener_test.cc:1166: Failure
Expected: (info.duration.count()) > (0), actual: 0 vs 0
Received signal 6 (Aborted)

[ RUN ] PerfContextTest.CPUTimer
db/perf_context_test.cc:950: Failure
Expected: (next_count) > (count), actual: 27385 vs 27385
terminate called after throwing an instance of 'testing::internal::GoogleTestFailureException'
what(): db/perf_context_test.cc:950: Failure
Expected: (next_count) > (count), actual: 27385 vs 27385
Received signal 6 (Aborted)

[ RUN      ] PerfContextTest.MergeOperatorTime
db/perf_context_test.cc:703: Failure
Expected: (get_perf_context()->merge_operator_time_nanos) > (0), actual: 0 vs 0
terminate called after throwing an instance of 'testing::internal::GoogleTestFailureException'
  what():  db/perf_context_test.cc:703: Failure
Expected: (get_perf_context()->merge_operator_time_nanos) > (0), actual: 0 vs 0
Received signal 6 (Aborted)

[ RUN      ] EnvTest.GenerateRawUniqueIdTrackEnvDetailsOnly
env/env_test.cc:2962: Failure
Expected equality of these values:
Inids.size()
    Which is: 7500
  kThreads * kIdsPerThrea
    Which is: 8000
terminate called after throwing an instance of 'testing::internal::GoogleTestFailureException'
  what():  env/env_test.cc:2962: Failurelibthread_db.so.1".
Expected equality of these values:
  ids.size()
    Which is: 7500_
  kThreads * kIdsPerThread
    Which is: 8000
Received signal 6 (Aborted)

Steps to reproduce the behavior

The tests failures can be reproduced using the following docker file and using an Amazon Linux EC2 instance of type i3.8xlarge

Dockerfile

FROM amazonlinux:2

RUN yum -y groupinstall "Development Tools"
RUN yum -y install which

RUN cd / && git clone --depth 1 https://github.com/facebook/rocksdb.git
RUN cd /rocksdb && make -j 32 all

WORKDIR /rocksdb
CMD make J=32 check TEST_TMPDIR=/tmp/rocksdb SKIP_FORMAT_BUCK_CHECKS=1

Investigation

On investigation, we found that there is a single cause of failures for all these tests. The tests are using std::chrono::steady_clock to measure durations in nanoseconds. Consecutive calls to steady_clock::now() are returning the same time point frequently and failing the assertion.

System configuration

System configuration

Platform: Amazon Linux Instance type: i3.8xlarge kernel: 5.10.176-157.645.amzn2.x86_64 g++ -v output:

Using built-in specs.
COLLECT_GCC=g++
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/7/lto-wrapper
Target: x86_64-redhat-linux
Configured with: ../configure --enable-bootstrap --enable-languages=c,c++,objc,obj-c++,fortran,ada,go,lto --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla —enable-shared —enable-threads=posix —enable-checking=release —enable-multilib —with-system-zlib —enable-__cxa_atexit —disable-libunwind-exceptions —enable-gnu-unique-object —enable-linker-build-id —with-gcc-major-version-only —with-linker-hash-style=gnu —enable-plugin —enable-initfini-array —with-isl —enable-libmpx —enable-libsanitizer —enable-gnu-indirect-function —enable-libcilkrts —enable-libatomic —enable-libquadmath —enable-libitm —with-tune=generic —with-arch_32=x86-64 —build=x86_64-redhat-linux
Thread model: posix
gcc version 7.3.1 20180712 (Red Hat 7.3.1-15) (GCC)
cz2h commented 1 year ago

Hi, I want to contribute to this issue but failed to reproduce the issues.

I followed the steps provided except the kernel version of my ec2 instance is "5.10.179-171.711.amzn2.x86_64".

Can someone else please confirm the issues? Thanks.

jehangiriqbal commented 1 year ago

Hi, thanks for looking into this issue

I feel that on your machine the behaviour of std::chrono::steady_clock::now is correct for consecutive calls. Please confirm it using the following code

repro.cpp


#include <iostream>
#include <iomanip>
#include <chrono>

int main()
{
    const size_t REPEAT_COUNT = 10'000'000;
    size_t count = 0;
    size_t count_zeros = 0;
    size_t total = 0;
    std::cout << std::fixed << std::setprecision(9) << std::left;
    for (auto i = 0; i < REPEAT_COUNT; ++i) {
        // record start time
        auto start = std::chrono::steady_clock::now();
        // do no work
        ++count;
        // record end time
        auto end = std::chrono::steady_clock::now();

        std::chrono::nanoseconds diff = end - start;
        if (diff.count() == 0) {
            ++count_zeros;
        } else {
            total += diff.count();
        }
    }
    std::cout << "steady_clock Total " << total << " ns - Average: "
              << (static_cast<double>(total) / count) << " ns\n";
    std::cout << "Got 0ns between steady_clock::now() calls "
              << count_zeros << " out of " << count << "\n";
}

Command to execute


g++ -std=c++17 repro.cpp -o repro
./repro

Expected output
steady_clock Total 545533735 ns - Average: 54.553373500 ns
Got 0ns between steady_clock::now() calls 0 out of 10000000

Incorrect output
steady_clock Total 545533735 ns - Average: 54.553373500 ns
Got 0ns between steady_clock::now() calls <any number other than 0> out of 10000000

Also, can you please check the clock type on the machine using this command

cat /sys/devices/system/clocksource/clocksource0/current_clocksource
cz2h commented 1 year ago

Hi, I am now able to reproduce this issue on a EC2 I3.8XLARGE instance now. Can someone assign this problem to me?

dnanuti commented 1 year ago

Also encountered failure of:

[ RUN ] PerfContextTest.DBMutexLockCounter
db/perf_context_test.cc:615: Failure
Expected: (get_perf_context()->db_mutex_lock_nanos) > (0), actual: 0 vs 0
terminate called after throwing an instance of 'testing::internal::GoogleTestFailureException'
  what():  db/perf_context_test.cc:615: Failure
Expected: (get_perf_context()->db_mutex_lock_nanos) > (0), actual: 0 vs 0 

Same setup as above.

cz2h commented 1 year ago

I think this type of issue could be common in existing unit tests. Can someone help look into this PR and comment on proposed fixs?