Closed musicinmybrain closed 4 months ago
I would conjecture that perhaps something somewhere is relying on undefined behavior, and as a result the optimizer ends up generating incorrect code – but an assertion happens to impede the optimization that reveals the problem. This would explain why the tests seem to pass when NDEBUG
is not defined.
Thanks for the detailed repro, which is helpful.
I was able to reproduce the problem with gcc. But not with clang; didn't try with visual studio.
FYI during the parser refactor I've had similar trouble with the visual studio 32 bit optimizer; none of the sanitizers or static analyzers were reporting any problem there or anywhere else. Like I noted there, I was only able to work around the problem, which I did by moving the problematic code to a separate function.
The outward signs are similar to that problem; if I enable the logs in the parser the failure goes away.
Also, using -O3
makes the failure go away as well.
Fun times ahead.
Thanks for the detailed repro, which is helpful.
You’re welcome. It took a few days of intermittently poking at this for me to get from “it breaks in my environment, but I can’t tell you why” to something useful enough to report.
I was able to reproduce the problem with gcc (but not with clang; didn't try with visual studio).
I’m glad to hear it! At least we have consistency.
Fun times ahead.
Yes, I always dread reporting bugs like this one because I know how hard they can be to investigate.
I’ll keep Fedora Rawhide at 0.6.0 for now. Please let me know if there’s anything I can do to help (other than actually finding the root cause!)
Well, the dread for these bugs is real, and pervasive. It is a laborious process to trace them to their source.
This command will reproduce all failures:
cd ~/proj/rapidyaml ; \
( export BDIR=440_o2 ; \
CXXFLAGS='-O2' cmake -S . -B $BDIR -DCMAKE_CXX_FLAGS_RELEASE:STRING="-DNDEBUG" -DRYML_BUILD_TESTS:BOOL=ON -DRYML_DBG:BOOL=OFF ; \
pairs=( \
"ryml-test-scalar_literal,--gtest_filter=*/YmlT*ix/7" \
"ryml-test-scalar_folded,--gtest_filter=*/YmlT*ix/0" \
"ryml-test-suite,W4TN-in_yaml /home/jpmag/proj/extern/yaml-test-suite/W4TN in.yaml --gtest_filter=_/unix_in_arena.1_compare_emitted_events_to_ref_events/0" \
) ; \
for pair in "${pairs[@]}" ; do \
IFS=, read tgt filt <<< "$pair" ; \
echo "$tgt -- $filt" ; \
cmake --build $BDIR -j --verbose --target $tgt && \
./$BDIR/test/$tgt $filt ; \
done \
)
I'll keep recording relevant info here.
Well, the first two failures seem to be similar, and related to the parsing code: the parser seems to be slurping multiple docs into a single scalar. The error I linked to above is related to scalar filtering (ie happens when the scalar has already been slurped and is being filtered), and at first glance seems to be different from these two.
Here are the failed assertions for the first two failures:
/home/usr/proj/rapidyaml/test/test_lib/test_case_node.cpp:162: Failure
Expected equality of these values:
actualch.val()
Which is: 'hello0
there
---
|
hello1
there
---
|
hello2
there
---
|
ciao3
qua
---
|
ciao4
qua
---
|
ciao5
qua
---
- |
hello6
there
- |
ciao7
qua
---
foo: |
hello8
there
bar: |
ciao9
qua
'
expectedch.val
Which is: 'hello0
there
'
Google Test trace:
/home/usr/proj/rapidyaml/test/test_lib/test_case_node.cpp:158: 1
/home/usr/proj/rapidyaml/test/test_lib/test_case_node.cpp:90: 0
/home/usr/proj/rapidyaml/test/test_lib/test_case_node.cpp:89: 0
/home/usr/proj/rapidyaml/test/test_lib/test_case_node.cpp:212: comparing: iteration based on the ref children
/home/usr/proj/rapidyaml/test/test_lib/test_case_node.cpp:179: 0
/home/usr/proj/rapidyaml/test/test_lib/test_group.cpp:71: comparing parsed tree to ref tree
/home/usr/proj/rapidyaml/test/./test_lib/test_group.def.hpp:13: unix style
/home/usr/proj/rapidyaml/test/test_scalar_literal.cpp:1396: case
/home/usr/proj/rapidyaml/test/test_lib/test_case_node.cpp:162: Failure
Expected equality of these values:
actualch.val()
Which is: 'hello there --- >
hello
there
--- >
hello
there
--- > ciao qua --- >
ciao
qua
--- >
ciao
qua
--- - >
hello
there
- >
ciao
qua
--- foo: >
hello
there
bar: >
ciao
qua
'
expectedch.val
Which is: 'hello there
'
Google Test trace:
/home/usr/proj/rapidyaml/test/test_lib/test_case_node.cpp:158: 1
/home/usr/proj/rapidyaml/test/test_lib/test_case_node.cpp:90: 0
/home/usr/proj/rapidyaml/test/test_lib/test_case_node.cpp:89: 0
/home/usr/proj/rapidyaml/test/test_lib/test_case_node.cpp:212: comparing: iteration based on the ref children
/home/usr/proj/rapidyaml/test/test_lib/test_case_node.cpp:179: 0
/home/usr/proj/rapidyaml/test/test_lib/test_group.cpp:71: comparing parsed tree to ref tree
/home/usr/proj/rapidyaml/test/./test_lib/test_group.def.hpp:13: unix style
/home/usr/proj/rapidyaml/test/test_scalar_folded.cpp:1858: case
I’m able to get some output from Clang’s static analyzers like this:
$ CXXFLAGS='-O2 -g' CXX=clang++ scan-build cmake -S . -B build -DCMAKE_CXX_FLAGS_RELEASE:STRING="-DNDEBUG" -DRYML_BUILD_TESTS:BOOL=ON
$ scan-build cmake --build build -j16 --verbose
I am attaching the output directory as scan-build-2024-06-21-132823-3551348-1.zip. It’s based on the latest master
, 31d8205d18d749f0410c42b1aff1c3ab3c14494c. Hopefully there is something helpful in there.
Just for completeness, I tried the clang-analyzer/scan-build build with -Og -g
and got exactly the same results. Then I tried omitting -DCMAKE_CXX_FLAGS_RELEASE:STRING="-DNDEBUG"
and still got the same results again. The analyzer is (as one might hope) basically insensitive to these build flags.
Thanks. I ran under scan-build and got the same results. I am looking at these findings; but from the outset none of them seem relevant to the problems above.
The scan-build warnings are now fixed (see branch fix/440), but the failures remain. Proceeding.
Found (the location of) the problem. While scanning a block literal scalar, there is this code in _scan_block()
:
Under -O2
this check is failing for lc.rem
consisting of ---
, which should not happen. Also, adding a print of lc.rem
in the line just prior to this check will also cause the check to become correct.
Diving into _is_doc_token()
, we have originally:
inline bool _is_doc_token(csubstr s) // FAILS!
{
if(s.len >= 3)
{
if(s.str[0] == '-')
return _is_doc_begin_token(s);
else if(s.str[0] == '.')
return _is_doc_end_token(s);
}
return false;
}
If the calls are explicitly inlined, the problem goes away:
inline bool _is_doc_token(csubstr s) // WORKS!
{
if(s.len >= 3)
{
if(s.str[0] == '-')
return (s.str[1] == '-' && s.str[2] == '-')
&& (s.len == 3 || (s.str[3] == ' ' _RYML_WITH_TAB_TOKENS(|| s.str[3] == '\t')));
else if(s.str[0] == '.')
return (s.str[1] == '.' && s.str[2] == '.')
&& (s.len == 3 || (s.str[3] == ' ' _RYML_WITH_TAB_TOKENS(|| s.str[3] == '\t')));
}
return false;
}
(note that a redundant check for s.len >= 3
was removed in the inner conditions, but I tested keeping it as well, and this still succeeds). FWIW, the callee functions look like this:
inline bool _is_doc_begin_token(csubstr s)
{
RYML_ASSERT(s.begins_with('-'));
RYML_ASSERT(!s.ends_with("\n"));
RYML_ASSERT(!s.ends_with("\r"));
return (s.len >= 3 && s.str[1] == '-' && s.str[2] == '-')
&& (s.len == 3 || (s.str[3] == ' ' _RYML_WITH_TAB_TOKENS(|| s.str[3] == '\t')));
}
inline bool _is_doc_end_token(csubstr s)
{
RYML_ASSERT(s.begins_with('.'));
RYML_ASSERT(!s.ends_with("\n"));
RYML_ASSERT(!s.ends_with("\r"));
return (s.len >= 3 && s.str[1] == '.' && s.str[2] == '.')
&& (s.len == 3 || (s.str[3] == ' ' _RYML_WITH_TAB_TOKENS(|| s.str[3] == '\t')));
}
So much for the trigger or, more precisely, the manifestation of the problem. What about the source of the problem?
I'm not really seeing it.
Inside these functions, I see nothing that would trigger UB. My working theory is this: something may be leading the optimizer to make erroneous assumptions about lc.rem
(which is the string passed into the function). Although I'd like to, I'm on the fence whether I should spend time investigating this.
Looking further into lc.rem
, it is set just above:
... and even when failing, the subsequent print correctly says blck: peeking at [3]~~~---~~~
.
Anyway, looking at that function:
void reset_with_next_line(substr buf, size_t offset)
{
RYML_ASSERT(offset <= buf.len);
char const* C4_RESTRICT b = &buf[offset];
char const* C4_RESTRICT e = b;
// get the current line stripped of newline chars
while(e < buf.end() && (*e != '\n' && *e != '\r'))
++e;
RYML_ASSERT(e >= b);
const substr stripped_ = buf.sub(offset, static_cast<size_t>(e - b));
// advance pos to include the first line ending
if(e != buf.end() && *e == '\r')
++e;
if(e != buf.end() && *e == '\n')
++e;
RYML_ASSERT(e >= b);
const substr full_ = buf.sub(offset, static_cast<size_t>(e - b));
reset(full_, stripped_);
}
Maybe? In general, the function could be improved. In particular, looking at e - b
which is signed, the optimizer could be having problems with the conversion to unsigned, and thereby taking (unwarranted) conclusions as to the length of the resulting substr. Maybe? Otherwise, I don't see much else going on in here.
To test, I've reinstated the problematic version of _is_doc_token()
and I've changed to this:
void reset_with_next_line(substr buf, size_t offset)
{
RYML_ASSERT(offset <= buf.len);
size_t e = offset;
// get the current line stripped of newline chars
while(e < buf.len && (buf.str[e] != '\n' && buf.str[e] != '\r'))
++e;
RYML_ASSERT(e >= offset);
const substr stripped_ = buf.range(offset, e);
// advance pos to include the first line ending
if(e < buf.len && buf.str[e] == '\r')
++e;
if(e < buf.len && buf.str[e] == '\n')
++e;
const substr full_ = buf.range(offset, e);
reset(full_, stripped_);
}
which is arguably cleaner. However, this has no impact on the failure; ie it manifests with both versions.
However, looking just below, there are comments mentioning gcc optimizer shenanigans with the same class:
So this may be related indeed.
Also, an important note: prints were enabled only to _scan_block()
by doing this:
modified src/c4/yml/detail/parser_dbg.hpp
@@ -63,7 +63,8 @@
//-----------------------------------------------------------------------------
-#ifdef RYML_DBG
+#if defined(RYML_DBG) || 1
+#include <cstdio>
#include <c4/dump.hpp>
namespace c4 {
modified src/c4/yml/parse_engine.def.hpp
@@ -1998,6 +1998,10 @@ typename ParseEngine<EventHandler>::ScannedScalar ParseEngine<EventHandler>::_sc
//-----------------------------------------------------------------------------
+#undef _c4dbgp
+#undef _c4dbgpf
+#define _c4dbgp(msg) _dbg_printf("{}:{}: " msg "\n", __FILE__, __LINE__ )
+#define _c4dbgpf(fmt, ...) _dbg_printf("{}:{}: " fmt "\n", __FILE__, __LINE__, ## __VA_ARGS__)
template<class EventHandler>
void ParseEngine<EventHandler>::_scan_block(ScannedBlock *C4_RESTRICT sb, size_t indref)
{
@@ -2222,6 +2226,11 @@ void ParseEngine<EventHandler>::_scan_block(ScannedBlock *C4_RESTRICT sb, size_t
sb->chomp = chomp;
}
+#undef _c4dbgp
+#undef _c4dbgpf
+#define _c4dbgp(fmt)
+#define _c4dbgpf(fmt, ...)
+
//-----------------------------------------------------------------------------
//-----------------------------------------------------------------------------
Like noted above, despite all other prints in the function, adding a print just before the call to _is_doc_token()
was causing the problem to go away.
@musicinmybrain The fixes in the linked PR address the failures for me. Please try the current master; feel free to reopen if it still fails.
This did work for me with https://github.com/biojppm/rapidyaml/commit/619ad08b7f60d998fc5ceabcf96a13bb34eb6fe1 and https://github.com/biojppm/c4core/commit/a95b20e8c02f4d8eb4425a85792d0f6f81007441. Thanks!
Are you planning new releases of c4core
and rapidyaml
any time soon?
Yes, releases will be coming soon.
Yes, releases will be coming soon.
(Polite ping regarding these pending releases.)
Ping received. Unfortunately, in the interim github retired some nodejs version needed to use the git checkout action; this meant that I had to spend a lot of time working around this problem, first in c4core, and now I need to do the same for rapidyaml. In the process, I failed to make the release before going on vacation. Now I'm back, and will get back on it.
Also, be aware that I intend to fix some of the bugs reported in the interim, before making the release. So current ETA is ~2-3 weeks, if things go well.
Ping received. Unfortunately, in the interim github retired some nodejs version needed to use the git checkout action; this meant that I had to spend a lot of time working around this problem, first in c4core, and now I need to do the same for rapidyaml. In the process, I failed to make the release before going on vacation. Now I'm back, and will get back on it.
Also, be aware that I intend to fix some of the bugs reported in the interim, before making the release. So current ETA is ~2-3 weeks, if things go well.
Sounds good to me! I’ll try out the new release when it’s ready.
@musicinmybrain Release 0.7.1 now available
@musicinmybrain Release 0.7.1 now available
I tested the latest versions of everything, and the results look much better! Unfortunately, I’m running into quite a few regressions on ppc64le
only:
The following tests FAILED:
1 - ryml-test-quickstart (Failed)
7 - ryml-test-parse_engine_6_qmrk (Failed)
8 - ryml-test-parse_engine_7_seqimap (SEGFAULT)
16 - ryml-test-tree (SEGFAULT)
18 - ryml-test-emit (Failed)
19 - ryml-test-style (Failed)
22 - ryml-test-json (Failed)
24 - ryml-test-merge (Failed)
28 - ryml-test-seq (SEGFAULT)
31 - ryml-test-map (SEGFAULT)
38 - ryml-test-scalar_null (Failed)
53 - ryml-test-anchor (Failed)
55 - ryml-test-number (Failed)
(I am not sure if I tested 0.7.0 thoroughly on non-x86_64
architectures or not.)
I was working on updating the Fedora package to 0.7.0 and encountered a couple of test failures:
I was able to reproduce these in a git checkout on Fedora 40 (
x86_64
) by setting the same build flags and options as the RPM build; here I encountered a third test with failures,To reproduce in a checkout, I imitated the RPM build. This uses the default options from the
%cmake
RPM macro, plus:-GNinja
– because it is faster and has no real disadvantages-DRYML_CXX_STANDARD=14
– for compatibility with the current packaged gtest, which no longer supports C++11; changing this to C++17 makes no difference-DRYML_BUILD_TESTS:BOOL=ON
-DRYML_TEST_FUZZ:BOOL=OFF
– so that we do not have to include the contents of https://github.com/biojppm/rapidyaml-data in the source RPMThat looks like:
By trial and error, I was able to minimize that to the following:
In short, in order to reproduce this, it seems to suffice to set
-O2 -NDEBUG
.