Some test failures in 0.7.0 when compiled gcc -O2

musicinmybrain commented 5 months ago

I was working on updating the Fedora package to 0.7.0 and encountered a couple of test failures:

97% tests passed, 2 tests failed out of 65
Total Test time (real) =   0.76 sec
The following tests FAILED:
     41 - ryml-test-scalar_literal (Failed)
     42 - ryml-test-scalar_folded (Subprocess aborted)

I was able to reproduce these in a git checkout on Fedora 40 (x86_64) by setting the same build flags and options as the RPM build; here I encountered a third test with failures,

99% tests passed, 3 tests failed out of 1644

Total Test time (real) =  21.90 sec

The following tests FAILED:
         41 - ryml-test-scalar_literal (Failed)
         42 - ryml-test-scalar_folded (Subprocess aborted)
        1543 - ryml-test-suite-W4TN-in_yaml-events (Failed)

To reproduce in a checkout, I imitated the RPM build. This uses the default options from the %cmake RPM macro, plus:

-GNinja – because it is faster and has no real disadvantages
-DRYML_CXX_STANDARD=14 – for compatibility with the current packaged gtest, which no longer supports C++11; changing this to C++17 makes no difference
-DRYML_BUILD_TESTS:BOOL=ON
-DRYML_TEST_FUZZ:BOOL=OFF – so that we do not have to include the contents of https://github.com/biojppm/rapidyaml-data in the source RPM

That looks like:

gh repo clone biojppm/rapidyaml
cd rapidyaml
git checkout v0.7.0
git submodule update --init --recursive

CFLAGS="${CFLAGS:--O2 -flto=auto -ffat-lto-objects -fexceptions -g -grecord-gcc-switches -pipe -Wall -Werror=format-security -Wp,-U_FORTIFY_SOURCE,-D_FORTIFY_SOURCE=3 -Wp,-D_GLIBCXX_ASSERTIONS -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -fstack-protector-strong -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1  -m64 -march=x86-64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection -fno-omit-frame-pointer -mno-omit-leaf-frame-pointer }" ; export CFLAGS ;
CXXFLAGS="${CXXFLAGS:--O2 -flto=auto -ffat-lto-objects -fexceptions -g -grecord-gcc-switches -pipe -Wall -Werror=format-security -Wp,-U_FORTIFY_SOURCE,-D_FORTIFY_SOURCE=3 -Wp,-D_GLIBCXX_ASSERTIONS -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -fstack-protector-strong -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1  -m64 -march=x86-64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection -fno-omit-frame-pointer -mno-omit-leaf-frame-pointer }" ; export CXXFLAGS ;
FFLAGS="${FFLAGS:--O2 -flto=auto -ffat-lto-objects -fexceptions -g -grecord-gcc-switches -pipe -Wall -Wp,-U_FORTIFY_SOURCE,-D_FORTIFY_SOURCE=3 -Wp,-D_GLIBCXX_ASSERTIONS -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -fstack-protector-strong -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1  -m64 -march=x86-64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection -fno-omit-frame-pointer -mno-omit-leaf-frame-pointer -I/usr/lib64/gfortran/modules }" ; export FFLAGS ; 
FCFLAGS="${FCFLAGS:--O2 -flto=auto -ffat-lto-objects -fexceptions -g -grecord-gcc-switches -pipe -Wall -Wp,-U_FORTIFY_SOURCE,-D_FORTIFY_SOURCE=3 -Wp,-D_GLIBCXX_ASSERTIONS -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -fstack-protector-strong -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1  -m64 -march=x86-64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection -fno-omit-frame-pointer -mno-omit-leaf-frame-pointer -I/usr/lib64/gfortran/modules }" ; export FCFLAGS ; 
VALAFLAGS="${VALAFLAGS:--g}" ; export VALAFLAGS ;
RUSTFLAGS="${RUSTFLAGS:--Copt-level=3 -Cdebuginfo=2 -Ccodegen-units=1 -Cstrip=none -Cforce-frame-pointers=yes --cap-lints=warn}" ; export RUSTFLAGS ; 
LDFLAGS="${LDFLAGS:--Wl,-z,relro -Wl,--as-needed  -Wl,-z,pack-relative-relocs -Wl,-z,now -specs=/usr/lib/rpm/redhat/redhat-hardened-ld-errors -specs=/usr/lib/rpm/redhat/redhat-hardened-ld -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1  -Wl,--build-id=sha1  }" ; export LDFLAGS ;
LT_SYS_LIBRARY_PATH="${LT_SYS_LIBRARY_PATH:-/usr/lib64:}" ; export LT_SYS_LIBRARY_PATH ;
CC="${CC:-gcc}" ; export CC ;
CXX="${CXX:-g++}" ; export CXX

/usr/bin/cmake \
        -S "." \
        -B "redhat-linux-build" \
        -DCMAKE_C_FLAGS_RELEASE:STRING="-DNDEBUG" \
        -DCMAKE_CXX_FLAGS_RELEASE:STRING="-DNDEBUG" \
        -DCMAKE_Fortran_FLAGS_RELEASE:STRING="-DNDEBUG" \
        -DCMAKE_VERBOSE_MAKEFILE:BOOL=ON \
        -DCMAKE_INSTALL_DO_STRIP:BOOL=OFF \
        -DCMAKE_INSTALL_PREFIX:PATH=/usr \
        -DINCLUDE_INSTALL_DIR:PATH=/usr/include \
        -DLIB_INSTALL_DIR:PATH=/usr/lib64 \
        -DSYSCONF_INSTALL_DIR:PATH=/etc \
        -DSHARE_INSTALL_PREFIX:PATH=/usr/share \
        -DLIB_SUFFIX=64 \
        -DBUILD_SHARED_LIBS:BOOL=ON \
        -GNinja \
        -DRYML_CXX_STANDARD=14 \
        -DRYML_BUILD_TESTS:BOOL=ON \
        -DRYML_TEST_FUZZ:BOOL=OFF

/usr/bin/cmake --build "redhat-linux-build" -j16 --verbose
/usr/bin/cmake --build "redhat-linux-build" -j16 --verbose --target ryml-test-run-verbose

By trial and error, I was able to minimize that to the following:

gh repo clone biojppm/rapidyaml
cd rapidyaml
git checkout v0.7.0
git submodule update --init --recursive

CXXFLAGS='-O2' /usr/bin/cmake -S . -B build \
        -DCMAKE_CXX_FLAGS_RELEASE:STRING="-DNDEBUG" \
        -DRYML_BUILD_TESTS:BOOL=ON

/usr/bin/cmake --build build -j16 --verbose
/usr/bin/cmake --build build -j16 --verbose --target ryml-test-run-verbose

In short, in order to reproduce this, it seems to suffice to set -O2 -NDEBUG.

musicinmybrain commented 5 months ago

I would conjecture that perhaps something somewhere is relying on undefined behavior, and as a result the optimizer ends up generating incorrect code – but an assertion happens to impede the optimization that reveals the problem. This would explain why the tests seem to pass when NDEBUG is not defined.

biojppm commented 5 months ago

Thanks for the detailed repro, which is helpful.

I was able to reproduce the problem with gcc. But not with clang; didn't try with visual studio.

FYI during the parser refactor I've had similar trouble with the visual studio 32 bit optimizer; none of the sanitizers or static analyzers were reporting any problem there or anywhere else. Like I noted there, I was only able to work around the problem, which I did by moving the problematic code to a separate function.

The outward signs are similar to that problem; if I enable the logs in the parser the failure goes away.

Also, using -O3 makes the failure go away as well.

Fun times ahead.

musicinmybrain commented 5 months ago

Thanks for the detailed repro, which is helpful.

You’re welcome. It took a few days of intermittently poking at this for me to get from “it breaks in my environment, but I can’t tell you why” to something useful enough to report.

I was able to reproduce the problem with gcc (but not with clang; didn't try with visual studio).

I’m glad to hear it! At least we have consistency.

Fun times ahead.

Yes, I always dread reporting bugs like this one because I know how hard they can be to investigate.

I’ll keep Fedora Rawhide at 0.6.0 for now. Please let me know if there’s anything I can do to help (other than actually finding the root cause!)

biojppm commented 4 months ago

Well, the dread for these bugs is real, and pervasive. It is a laborious process to trace them to their source.

This command will reproduce all failures:

cd ~/proj/rapidyaml ;  \
( export BDIR=440_o2 ; \
  CXXFLAGS='-O2' cmake -S . -B $BDIR -DCMAKE_CXX_FLAGS_RELEASE:STRING="-DNDEBUG" -DRYML_BUILD_TESTS:BOOL=ON -DRYML_DBG:BOOL=OFF ; \
  pairs=( \
    "ryml-test-scalar_literal,--gtest_filter=*/YmlT*ix/7" \
    "ryml-test-scalar_folded,--gtest_filter=*/YmlT*ix/0" \
    "ryml-test-suite,W4TN-in_yaml /home/jpmag/proj/extern/yaml-test-suite/W4TN in.yaml --gtest_filter=_/unix_in_arena.1_compare_emitted_events_to_ref_events/0" \
  ) ; \
  for pair in "${pairs[@]}" ; do \
    IFS=, read tgt filt <<< "$pair"  ; \
    echo "$tgt -- $filt" ; \
    cmake --build $BDIR -j --verbose --target $tgt && \
    ./$BDIR/test/$tgt $filt ; \
  done \
)

I'll keep recording relevant info here.

biojppm commented 4 months ago

Well, the first two failures seem to be similar, and related to the parsing code: the parser seems to be slurping multiple docs into a single scalar. The error I linked to above is related to scalar filtering (ie happens when the scalar has already been slurped and is being filtered), and at first glance seems to be different from these two.

Here are the failed assertions for the first two failures:

scalar_literal

/home/usr/proj/rapidyaml/test/test_lib/test_case_node.cpp:162: Failure
Expected equality of these values:
  actualch.val()
    Which is: 'hello0
there
---
|
 hello1
 there
---
|
  hello2
  there
---
|
ciao3
qua
---
|
    ciao4
    qua
---
|
      ciao5
      qua
---
- |
 hello6
 there
- |
 ciao7
 qua
---
foo: |
 hello8
 there
bar: |
 ciao9
 qua
'
  expectedch.val
    Which is: 'hello0
there
'
Google Test trace:
/home/usr/proj/rapidyaml/test/test_lib/test_case_node.cpp:158: 1
/home/usr/proj/rapidyaml/test/test_lib/test_case_node.cpp:90: 0
/home/usr/proj/rapidyaml/test/test_lib/test_case_node.cpp:89: 0
/home/usr/proj/rapidyaml/test/test_lib/test_case_node.cpp:212: comparing: iteration based on the ref children
/home/usr/proj/rapidyaml/test/test_lib/test_case_node.cpp:179: 0
/home/usr/proj/rapidyaml/test/test_lib/test_group.cpp:71: comparing parsed tree to ref tree
/home/usr/proj/rapidyaml/test/./test_lib/test_group.def.hpp:13: unix style
/home/usr/proj/rapidyaml/test/test_scalar_literal.cpp:1396: case

scalar_folded

/home/usr/proj/rapidyaml/test/test_lib/test_case_node.cpp:162: Failure
Expected equality of these values:
  actualch.val()
    Which is: 'hello there --- >
 hello
 there
--- >
  hello
  there
--- > ciao qua --- >
    ciao
    qua
--- >
      ciao
      qua
--- - >
 hello
 there
- >
 ciao
 qua
--- foo: >
 hello
 there
bar: >
 ciao
 qua
'
  expectedch.val
    Which is: 'hello there
'
Google Test trace:
/home/usr/proj/rapidyaml/test/test_lib/test_case_node.cpp:158: 1
/home/usr/proj/rapidyaml/test/test_lib/test_case_node.cpp:90: 0
/home/usr/proj/rapidyaml/test/test_lib/test_case_node.cpp:89: 0
/home/usr/proj/rapidyaml/test/test_lib/test_case_node.cpp:212: comparing: iteration based on the ref children
/home/usr/proj/rapidyaml/test/test_lib/test_case_node.cpp:179: 0
/home/usr/proj/rapidyaml/test/test_lib/test_group.cpp:71: comparing parsed tree to ref tree
/home/usr/proj/rapidyaml/test/./test_lib/test_group.def.hpp:13: unix style
/home/usr/proj/rapidyaml/test/test_scalar_folded.cpp:1858: case

musicinmybrain commented 4 months ago

I’m able to get some output from Clang’s static analyzers like this:

$ CXXFLAGS='-O2 -g' CXX=clang++ scan-build cmake -S . -B build -DCMAKE_CXX_FLAGS_RELEASE:STRING="-DNDEBUG" -DRYML_BUILD_TESTS:BOOL=ON
$ scan-build cmake --build build -j16 --verbose

I am attaching the output directory as scan-build-2024-06-21-132823-3551348-1.zip. It’s based on the latest master, 31d8205d18d749f0410c42b1aff1c3ab3c14494c. Hopefully there is something helpful in there.

musicinmybrain commented 4 months ago

Just for completeness, I tried the clang-analyzer/scan-build build with -Og -g and got exactly the same results. Then I tried omitting -DCMAKE_CXX_FLAGS_RELEASE:STRING="-DNDEBUG" and still got the same results again. The analyzer is (as one might hope) basically insensitive to these build flags.

biojppm commented 4 months ago

Thanks. I ran under scan-build and got the same results. I am looking at these findings; but from the outset none of them seem relevant to the problems above.

biojppm commented 4 months ago

The scan-build warnings are now fixed (see branch fix/440), but the failures remain. Proceeding.

biojppm commented 4 months ago

Found (the location of) the problem. While scanning a block literal scalar, there is this code in _scan_block():

https://github.com/biojppm/rapidyaml/blob/e83730943f5c81bec99145b6b286b1988e8e84c9/src/c4/yml/parse_engine.def.hpp#L2085

Under -O2 this check is failing for lc.rem consisting of ---, which should not happen. Also, adding a print of lc.rem in the line just prior to this check will also cause the check to become correct.

Diving into _is_doc_token(), we have originally:

inline bool _is_doc_token(csubstr s) // FAILS!
{
    if(s.len >= 3)
    {
        if(s.str[0] == '-')
            return _is_doc_begin_token(s);
        else if(s.str[0] == '.')
            return _is_doc_end_token(s);
    }
    return false;
}

If the calls are explicitly inlined, the problem goes away:

inline bool _is_doc_token(csubstr s) // WORKS!
{
    if(s.len >= 3)
    {
        if(s.str[0] == '-')
            return (s.str[1] == '-' && s.str[2] == '-')
                && (s.len == 3 || (s.str[3] == ' ' _RYML_WITH_TAB_TOKENS(|| s.str[3] == '\t')));
        else if(s.str[0] == '.')
            return (s.str[1] == '.' && s.str[2] == '.')
                && (s.len == 3 || (s.str[3] == ' ' _RYML_WITH_TAB_TOKENS(|| s.str[3] == '\t')));
    }
    return false;
}

(note that a redundant check for s.len >= 3 was removed in the inner conditions, but I tested keeping it as well, and this still succeeds). FWIW, the callee functions look like this:

inline bool _is_doc_begin_token(csubstr s)
{
    RYML_ASSERT(s.begins_with('-'));
    RYML_ASSERT(!s.ends_with("\n"));
    RYML_ASSERT(!s.ends_with("\r"));
    return (s.len >= 3 && s.str[1] == '-' && s.str[2] == '-')
        && (s.len == 3 || (s.str[3] == ' ' _RYML_WITH_TAB_TOKENS(|| s.str[3] == '\t')));
}

inline bool _is_doc_end_token(csubstr s)
{
    RYML_ASSERT(s.begins_with('.'));
    RYML_ASSERT(!s.ends_with("\n"));
    RYML_ASSERT(!s.ends_with("\r"));
    return (s.len >= 3 && s.str[1] == '.' && s.str[2] == '.')
        && (s.len == 3 || (s.str[3] == ' ' _RYML_WITH_TAB_TOKENS(|| s.str[3] == '\t')));
}

So much for the trigger or, more precisely, the manifestation of the problem. What about the source of the problem?

I'm not really seeing it.

Inside these functions, I see nothing that would trigger UB. My working theory is this: something may be leading the optimizer to make erroneous assumptions about lc.rem (which is the string passed into the function). Although I'd like to, I'm on the fence whether I should spend time investigating this.

biojppm commented 4 months ago

Looking further into lc.rem, it is set just above:

https://github.com/biojppm/rapidyaml/blob/e83730943f5c81bec99145b6b286b1988e8e84c9/src/c4/yml/parse_engine.def.hpp#L2065

... and even when failing, the subsequent print correctly says blck: peeking at [3]~~~---~~~.

Anyway, looking at that function:

    void reset_with_next_line(substr buf, size_t offset)
    {
        RYML_ASSERT(offset <= buf.len);
        char const* C4_RESTRICT b = &buf[offset];
        char const* C4_RESTRICT e = b;
        // get the current line stripped of newline chars
        while(e < buf.end() && (*e != '\n' && *e != '\r'))
            ++e;
        RYML_ASSERT(e >= b);
        const substr stripped_ = buf.sub(offset, static_cast<size_t>(e - b));
        // advance pos to include the first line ending
        if(e != buf.end() && *e == '\r')
            ++e;
        if(e != buf.end() && *e == '\n')
            ++e;
        RYML_ASSERT(e >= b);
        const substr full_ = buf.sub(offset, static_cast<size_t>(e - b));
        reset(full_, stripped_);
    }

Maybe? In general, the function could be improved. In particular, looking at e - b which is signed, the optimizer could be having problems with the conversion to unsigned, and thereby taking (unwarranted) conclusions as to the length of the resulting substr. Maybe? Otherwise, I don't see much else going on in here.

To test, I've reinstated the problematic version of _is_doc_token() and I've changed to this:

    void reset_with_next_line(substr buf, size_t offset)
    {
        RYML_ASSERT(offset <= buf.len);
        size_t e = offset;
        // get the current line stripped of newline chars
        while(e < buf.len && (buf.str[e] != '\n' && buf.str[e] != '\r'))
            ++e;
        RYML_ASSERT(e >= offset);
        const substr stripped_ = buf.range(offset, e);
        // advance pos to include the first line ending
        if(e < buf.len && buf.str[e] == '\r')
            ++e;
        if(e < buf.len && buf.str[e] == '\n')
            ++e;
        const substr full_ = buf.range(offset, e);
        reset(full_, stripped_);
    }

which is arguably cleaner. However, this has no impact on the failure; ie it manifests with both versions.

However, looking just below, there are comments mentioning gcc optimizer shenanigans with the same class:

https://github.com/biojppm/rapidyaml/blob/e83730943f5c81bec99145b6b286b1988e8e84c9/src/c4/yml/parser_state.hpp#L89

So this may be related indeed.

biojppm commented 4 months ago

Also, an important note: prints were enabled only to _scan_block() by doing this:

modified   src/c4/yml/detail/parser_dbg.hpp
@@ -63,7 +63,8 @@

 //-----------------------------------------------------------------------------

-#ifdef RYML_DBG
+#if defined(RYML_DBG) || 1
+#include <cstdio>

 #include <c4/dump.hpp>
 namespace c4 {
modified   src/c4/yml/parse_engine.def.hpp
@@ -1998,6 +1998,10 @@ typename ParseEngine<EventHandler>::ScannedScalar ParseEngine<EventHandler>::_sc

 //-----------------------------------------------------------------------------
+#undef _c4dbgp
+#undef _c4dbgpf
+#define _c4dbgp(msg)        _dbg_printf("{}:{}: "   msg "\n", __FILE__, __LINE__                )
+#define _c4dbgpf(fmt, ...)  _dbg_printf("{}:{}: "   fmt "\n", __FILE__, __LINE__, ## __VA_ARGS__)
 template<class EventHandler>
 void ParseEngine<EventHandler>::_scan_block(ScannedBlock *C4_RESTRICT sb, size_t indref)
 {
@@ -2222,6 +2226,11 @@ void ParseEngine<EventHandler>::_scan_block(ScannedBlock *C4_RESTRICT sb, size_t
     sb->chomp = chomp;
 }

+#undef _c4dbgp
+#undef _c4dbgpf
+#define _c4dbgp(fmt)
+#define _c4dbgpf(fmt, ...)
+

 //-----------------------------------------------------------------------------
 //-----------------------------------------------------------------------------

Like noted above, despite all other prints in the function, adding a print just before the call to _is_doc_token() was causing the problem to go away.

biojppm commented 4 months ago

@musicinmybrain The fixes in the linked PR address the failures for me. Please try the current master; feel free to reopen if it still fails.

musicinmybrain commented 4 months ago

This did work for me with https://github.com/biojppm/rapidyaml/commit/619ad08b7f60d998fc5ceabcf96a13bb34eb6fe1 and https://github.com/biojppm/c4core/commit/a95b20e8c02f4d8eb4425a85792d0f6f81007441. Thanks!

Are you planning new releases of c4core and rapidyaml any time soon?

biojppm commented 4 months ago

Yes, releases will be coming soon.

musicinmybrain commented 3 months ago

Yes, releases will be coming soon.

(Polite ping regarding these pending releases.)

biojppm commented 3 months ago

Ping received. Unfortunately, in the interim github retired some nodejs version needed to use the git checkout action; this meant that I had to spend a lot of time working around this problem, first in c4core, and now I need to do the same for rapidyaml. In the process, I failed to make the release before going on vacation. Now I'm back, and will get back on it.

Also, be aware that I intend to fix some of the bugs reported in the interim, before making the release. So current ETA is ~2-3 weeks, if things go well.

musicinmybrain commented 3 months ago

Ping received. Unfortunately, in the interim github retired some nodejs version needed to use the git checkout action; this meant that I had to spend a lot of time working around this problem, first in c4core, and now I need to do the same for rapidyaml. In the process, I failed to make the release before going on vacation. Now I'm back, and will get back on it.

Also, be aware that I intend to fix some of the bugs reported in the interim, before making the release. So current ETA is ~2-3 weeks, if things go well.

Sounds good to me! I’ll try out the new release when it’s ready.

biojppm commented 3 months ago

@musicinmybrain Release 0.7.1 now available

musicinmybrain commented 3 months ago

@musicinmybrain Release 0.7.1 now available

I tested the latest versions of everything, and the results look much better! Unfortunately, I’m running into quite a few regressions on ppc64le only:

The following tests FAILED:
      1 - ryml-test-quickstart (Failed)
      7 - ryml-test-parse_engine_6_qmrk (Failed)
      8 - ryml-test-parse_engine_7_seqimap (SEGFAULT)
     16 - ryml-test-tree (SEGFAULT)
     18 - ryml-test-emit (Failed)
     19 - ryml-test-style (Failed)
     22 - ryml-test-json (Failed)
     24 - ryml-test-merge (Failed)
     28 - ryml-test-seq (SEGFAULT)
     31 - ryml-test-map (SEGFAULT)
     38 - ryml-test-scalar_null (Failed)
     53 - ryml-test-anchor (Failed)
     55 - ryml-test-number (Failed)

builder-live.log

(I am not sure if I tested 0.7.0 thoroughly on non-x86_64 architectures or not.)

biojppm / rapidyaml

Some test failures in 0.7.0 when compiled gcc -O2 #440

scalar_literal

scalar_folded