biojppm / rapidyaml

Rapid YAML - a library to parse and emit YAML, and do it fast.
MIT License
563 stars 95 forks source link

0.7.1 tests failing in ppc64le with g++14 -O2 #464

Closed biojppm closed 2 weeks ago

biojppm commented 3 weeks ago

TL;DR full symptoms, gathered below:

Original report

v0.7.1 is showing quite a few regressions on ppc64le only:

The following tests FAILED:
      1 - ryml-test-quickstart (Failed)
      7 - ryml-test-parse_engine_6_qmrk (Failed)
      8 - ryml-test-parse_engine_7_seqimap (SEGFAULT)
     16 - ryml-test-tree (SEGFAULT)
     18 - ryml-test-emit (Failed)
     19 - ryml-test-style (Failed)
     22 - ryml-test-json (Failed)
     24 - ryml-test-merge (Failed)
     28 - ryml-test-seq (SEGFAULT)
     31 - ryml-test-map (SEGFAULT)
     38 - ryml-test-scalar_null (Failed)
     53 - ryml-test-anchor (Failed)
     55 - ryml-test-number (Failed)

builder-live.log

(I am not sure if I tested 0.7.0 thoroughly on non-x86_64 architectures or not.)

Originally posted by @musicinmybrain in https://github.com/biojppm/rapidyaml/issues/440#issuecomment-2294036666

biojppm commented 3 weeks ago

@musicinmybrain Thanks for reporting.

From glancing at the logs I see this is also compiled with -O2. I am guessing this is the same as #440, but now with the ppc64le frontend. ppc64le is tested in the CI of this project, and it succeeds with 0.7.1, but not with -O2; only with -O0 and -O3, as these are the cmake presets. (Of course, if it didn't succeed, it would not be released).

Two significant questions for reflection:

  1. what specific flag/flags triggered by -O2 and not by either of -O3 or -O0 causes the problem?
  2. what is the specific UB that these flags rely on, and which triggers the wrong codegen by the frontend?

Having an answer to 2 would help in revising/improving the code. However, while the answer to 1 can be found with some effort (given the constraints with this architecture), I'm not sure that the answer to 2 can be found. But possibly the answer to 1 will significantly narrow the search space.

Also, while we were hypothesizing in #440 that the cause for the problem is UB, it could also be that gcc has a bug in its frontend, and the bug manifests only with the flags from 1.


Chasing this bug will be significantly harder than #440, as I don't have access to ppc64le hardware, and any attempt from my side at diagnosing and fixing the problem must involve a CI roundtrip. Do you have access to ppc64le?

musicinmybrain commented 3 weeks ago

Chasing this bug will be significantly harder than #440, as I don't have access to ppc64le hardware, and any attempt from my side at diagnosing and fixing the problem must involve a CI roundtrip. Do you have access to ppc64le?

I don’t have general-purpose interactive shell access to real ppc64le hardware.

I can do as many experimental RPM package builds in COPR as I want using real pp64le hardware – see https://copr.fedorainfracloud.org/coprs/music/rapidyaml/packages/ – and so can anyone who makes a FAS account and is willing to express everything as an RPM package. (As a Fedora packager, I can also do “scratch builds” in real Fedora infrastructure, but this doesn’t help for testing coordinated updates of multiple packages like this – we use COPR for that!)

I can reproduce the error in an emulated ppc64le mock chroot on my own machine via qemu-user-static. This does allow me to play around interactively, including choosing to enable network access and build things directly from git checkouts:

mock -r fedora-rawhide-ppc64le --enable-network --shell

Something like that is probably the most promising avenue if you want to try to play with this interactively yourself.

I’m happy to run tests with the tools at my disposal.

The flags corresponding to each of the optimization levels are documented at https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html. I wonder if this happens with -O1

musicinmybrain commented 3 weeks ago

For Fedora’s purposes, I could increase the optimization level to -O3 on ppc64le as a workaround. We should respect the distribution’s default compiler flags in general, but can override them with a good justification. For -O3, that would usually be a benchmark showing a significant gain, but this would be a good justification too. Of course, that’s a “magical” fix that could stop working at any time based on compiler implementation details, so I would view it as a temporary workaround.

I just confirmed that tests pass with -O1 for me. I’m checking -O3 now.

If there is a single flag that triggers test failures between -O1 and -O2, we could probably find it by bisecting. (There are many fewer flags added from -O2 to -O3 than from -O1 to -O2, but the flag that makes things work again is probably less revealing than the one that breaks them in the first place.) If triggering the problem requires several optimization flags acting together – which would not be surprising – it might be harder to understand exactly which flags are involved.

musicinmybrain commented 3 weeks ago

This actually fails with -O3 for me, too:

builder-live-ppc64le-O3.log

It’s possible that some of the other default flags in Fedora (including enabling LTO) are affecting this.

biojppm commented 3 weeks ago

I am trying to create a Minimal Verifiable Example. Looking at the logs you originally posted, I see quickstart fails:

image

That's convenient to have a MVE!

So I edited the quickstart example to run only the function where this happens, sample_json(), and tweaked the rarearchs workflow to try with -O0... all the way to -O3, with/without -g -DNDEBUG. (Branch is fix/464_ub_ppc64le).

Unfortunately none of these fail :-/.

Can you try to get it to fail on your end? Relevant commands should be:

cd samples/add_subdirectory
mkdir -p build
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release \
      -DCMAKE_CXX_FLAGS_RELEASE:STRING="...."  # fill with appropriate options here
cmake --build build --config Release --target run
musicinmybrain commented 3 weeks ago

So I started by trying to imitate Fedora’s build flags as closely as I reasonably could, with the idea that if I could reproduce this with samples/add_subdirectory, I would

$ mock -r fedora-rawhide-ppc64le --clean
$ mock -r fedora-rawhide-ppc64le -i git-core cmake gcc-c++ 'cmake(gtest)'
$ mock -r fedora-rawhide-ppc64le --shell --enable-network
<mock-chroot> sh-5.2# git clone https://github.com/biojppm/rapidyaml.git
<mock-chroot> sh-5.2# cd rapidyaml/
<mock-chroot> sh-5.2# git checkout fix/464_ub_ppc64le
<mock-chroot> sh-5.2# git submodule update --init --recursive
<mock-chroot> sh-5.2# cd samples/add_subdirectory
<mock-chroot> sh-5.2# CXXFLAGS="-O2 -flto=auto -ffat-lto-objects -fexceptions -g -grecord-gcc-switches -pipe -Wall -Wp,-U_FORTIFY_SOURCE,-D_FORTIFY_SOURCE=3 -Wp,-D_GLIBCXX_ASSERTIONS -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -fstack-protector-strong -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1  -m64 -mcpu=power8 -mtune=power8 -fasynchronous-unwind-tables -fstack-clash-protection" cmake -S . -B build -DRYML_CXX_STANDARD=14
<mock-chroot> sh-5.2# cmake --build build -j16 --target run
[ 95%] Linking CXX executable ryml-quickstart
/builddir/rapidyaml/ext/c4core/src/c4/charconv.hpp:197:3: warning: type ‘c4::RealFormat_e’ violates the C++ One Definition Rule [-Wodr]
  197 | } RealFormat_e;
      |   ^
/builddir/rapidyaml/ext/c4core/src/c4/charconv.hpp:209:3: note: an enum with different value name is defined in another translation unit
  209 | } RealFormat_e;
      |   ^
/builddir/rapidyaml/ext/c4core/src/c4/charconv.hpp:190:5: note: name ‘FTOA_FLOAT’ is defined as 32-bit while another translation unit defines it as 8-bit
  190 |     FTOA_FLOAT = static_cast<std::underlying_type<std::chars_format>::type>(std::chars_format::fixed),
      |     ^
/builddir/rapidyaml/ext/c4core/src/c4/charconv.hpp:202:5: note: mismatching definition
  202 |     FTOA_FLOAT = 'f',
      |     ^
[ 95%] Built target ryml-quickstart
[100%] running: /builddir/rapidyaml/samples/add_subdirectory/build/ryml-quickstart
WTF1:
qemu: uncaught target signal 11 (Segmentation fault) - core dumped
gmake[3]: *** [CMakeFiles/run.dir/build.make:71: CMakeFiles/run] Segmentation fault (core dumped)

I haven’t had a chance to try to minimize the compiler flags yet, but that looked interesting enough to go ahead and report it!

musicinmybrain commented 3 weeks ago

The warning persists when I drop -DRYML_CXX_STANDARD=14, but goes away when I drop -flto=auto -ffat-lto-objects. That make sense, because the compiler can’t detect ODR violations if we aren’t asking it to consider multiple translation units.

I’ve reduced the compiler flags to

<mock-chroot> sh-5.2# CXXFLAGS="-O2 -fexceptions -g -grecord-gcc-switches -pipe -Wall -Wp,-U_FORTIFY_SOURCE,-D_FORTIFY_SOURCE=3 -Wp,-D_GLIBCXX_ASSERTIONS -fstack-protector-strong -m64 -mcpu=power8 -mtune=power8 " cmake -S . -B build

while still reproducing the segfault.

I assume I can reduce them further, but it takes a while to compile the example under emulation.

musicinmybrain commented 3 weeks ago

Same segfault down to CXXFLAGS="-O2 -fexceptions -g -grecord-gcc-switches -pipe -Wall -Wp,-U_FORTIFY_SOURCE,-D_FORTIFY_SOURCE=3 -Wp,-D_GLIBCXX_ASSERTIONS -fstack-protector-strong"; dropping -fstack-protector-strong gets me a test failure instead of a segfault:

[100%] running: /builddir/rapidyaml/samples/add_subdirectory/build/ryml-quickstart
WTF1:
{"": "","": "","": "","": ""}/builddir/rapidyaml/samples/quickstart.cpp:4293: ERROR: ryml::emitrs_json<std::string>(tree) == R"({"doe": "a deer, a female deer","ray": "a drop of golden sun","me": "a name, I call myself","far": "a long long way to go"})"
/builddir/rapidyaml/samples/quickstart.cpp:4294: OK! ryml::emitrs_json<std::string>(json_tree) == R"({"doe": "a deer, a female deer","ray": "a drop of golden sun","me": "a name, I call myself","far": "a long long way to go"})"
WTF2:
{"": "","": "","": "","": ""}/builddir/rapidyaml/samples/quickstart.cpp:4299: ERROR: ss.str() == R"({"doe": "a deer, a female deer","ray": "a drop of golden sun","me": "a name, I call myself","far": "a long long way to go"})"
Completed 3 checks.
ERROR: 2/3 checks failed.
musicinmybrain commented 3 weeks ago

If I restore the LTO flags (-flto=auto -ffat-lto-objects), then the ODR violation warning reappears. If I also add std=c++11 so that the example is compiled as C++11 rather than the unspecified default (C++17 in practice), then I can get the test failures without the ODR violation warning. This makes sense because the different definitions of FTOA_FLOAT were due to different C++ standards in different translation units. However, it also demonstrates that the ODR violation is not the cause of the test failure.

<mock-chroot> sh-5.2# CXXFLAGS="-O2 -flto=auto -ffat-lto-objects  -fexceptions -g -grecord-gcc-switches -pipe -Wall -Wp,-U_FORTIFY_SOURCE,-D_FORTIFY_SOURCE=3 -Wp,-D_GLIBCXX_ASSERTIONS -std=c++11" cmake -S . -B build
[ 95%] Linking CXX executable ryml-quickstart
/usr/bin/cmake -E cmake_link_script CMakeFiles/ryml-quickstart.dir/link.txt --verbose=1
/usr/bin/c++ -O2 -flto=auto -ffat-lto-objects  -fexceptions -g -grecord-gcc-switches -pipe -Wall -Wp,-U_FORTIFY_SOURCE,-D_FORTIFY_SOURCE=3 -Wp,-D_GLIBCXX_ASSERTIONS -std=c++11 "CMakeFiles/ryml-quickstart.dir/builddir/rapidyaml/samples/quickstart.cpp.o" -o ryml-quickstart  ryml/libryml.a 
gmake[3]: Leaving directory '/builddir/rapidyaml/samples/add_subdirectory/build'
[ 95%] Built target ryml-quickstart
/usr/bin/gmake  -f CMakeFiles/run.dir/build.make CMakeFiles/run.dir/depend
gmake[3]: Entering directory '/builddir/rapidyaml/samples/add_subdirectory/build'
cd /builddir/rapidyaml/samples/add_subdirectory/build && /usr/bin/cmake -E cmake_depends "Unix Makefiles" /builddir/rapidyaml/samples/add_subdirectory /builddir/rapidyaml/samples/add_subdirectory /builddir/rapidyaml/samples/add_subdirectory/build /builddir/rapidyaml/samples/add_subdirectory/build /builddir/rapidyaml/samples/add_subdirectory/build/CMakeFiles/run.dir/DependInfo.cmake "--color="
gmake[3]: Leaving directory '/builddir/rapidyaml/samples/add_subdirectory/build'
/usr/bin/gmake  -f CMakeFiles/run.dir/build.make CMakeFiles/run.dir/build
gmake[3]: Entering directory '/builddir/rapidyaml/samples/add_subdirectory/build'
[100%] running: /builddir/rapidyaml/samples/add_subdirectory/build/ryml-quickstart
./ryml-quickstart
WTF1:
{"": "","": "","": "","": ""}/builddir/rapidyaml/samples/quickstart.cpp:4293: ERROR: ryml::emitrs_json<std::string>(tree) == R"({"doe": "a deer, a female deer","ray": "a drop of golden sun","me": "a name, I call myself","far": "a long long way to go"})"
/builddir/rapidyaml/samples/quickstart.cpp:4294: OK! ryml::emitrs_json<std::string>(json_tree) == R"({"doe": "a deer, a female deer","ray": "a drop of golden sun","me": "a name, I call myself","far": "a long long way to go"})"
WTF2:
{"": "","": "","": "","": ""}/builddir/rapidyaml/samples/quickstart.cpp:4299: ERROR: ss.str() == R"({"doe": "a deer, a female deer","ray": "a drop of golden sun","me": "a name, I call myself","far": "a long long way to go"})"
Completed 3 checks.
ERROR: 2/3 checks failed.
biojppm commented 3 weeks ago

Thanks for the investigation and the repro -- this is invaluable. That is exactly the same error as in your logs, so this means we have a reliable way to reproduce on a local machine.

I did not know about mock, and I am only somewhat familiar with qemu, but now there's a very good reason to pick these up. I'll report back when I'm able to reproduce on my end.

biojppm commented 3 weeks ago

As for the ODR problem, let's track it in the linked issue. Thanks for reporting that one too!

biojppm commented 3 weeks ago

Cannot reproduce it in ubuntu using powerpc64le-linux-gnu-g++ together with qemu-ppc64le-static...

Here's what I did, which succeeds:

set -xe ; \
cd ~/proj/rapidyaml/samples/add_subdirectory/ ; \
sys=powerpc64le ; \
cxxflags="-O2 -flto=auto -ffat-lto-objects  -fexceptions -g -grecord-gcc-switches -pipe -Wall -Wp,-U_FORTIFY_SOURCE,-D_FORTIFY_SOURCE=3 -Wp,-D_GLIBCXX_ASSERTIONS -std=c++11 -DNDEBUG" ; \
bdir=build/$sys ; \
mkdir -p $bdir ; \
cmake -S . -B $bdir -D CMAKE_BUILD_TYPE=Release \
      -DCMAKE_CXX_FLAGS_RELEASE:STRING="$cxxflags" \
      --toolchain ../../.github/toolchains/linux-gnu-$sys.cmake  ;  \
cmake --build $bdir --target run -j --verbose

Also, adding -fstack-protector-strong did not cause any segfault...

(BTW, I've tweaked the quickstart to make compilation faster; it now goes under 5s in my machine)

biojppm commented 3 weeks ago

I tried further now, using a different ubuntu distro. No luck, could not reproduce it with any of those flags.

If the compiler flags work in one OS, but not in another OS, it is starting to look like something else in the environment, like compiler/linker version.

I don't have easy access to any RedHat distro to use mock to try to reproduce. Is there a docker image that you recommend? Or another easy way?

FWIW in the previous comment I used ubuntu22.04, which has

root@bach:/src# powerpc64le-linux-gnu-g++ --version
powerpc64le-linux-gnu-g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

root@bach:/src# qemu-ppc64le-static --version
qemu-ppc64le version 6.2.0 (Debian 1:6.2+dfsg-2ubuntu6.22)

now I used ubuntu18.04 which has

root@bach:/rapidyaml# powerpc64le-linux-gnu-g++ --version
powerpc64le-linux-gnu-g++ (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0

root@bach:/rapidyaml# qemu-ppc64le-static --version
qemu-ppc64le version 2.11.1(Debian 1:2.11+dfsg-1ubuntu7.42)

This does not confirm any version related suspicion.

FWIW, the prior setup was simply

$ apt-get install -y \
                 gcc-powerpc64le-linux-gnu \
                 g++-powerpc64le-linux-gnu \
                 qemu \
                 qemu-system \
                 qemu-user-static

and then running the command in the previous comment.

biojppm commented 3 weeks ago

I also tried running all the tests you listed as failing. None of them failed...

( set -xe ; \
cxxflags="-O2 -flto=auto -ffat-lto-objects  -fexceptions -g -grecord-gcc-switches -pipe -Wall -Wp,-U_FORTIFY_SOURCE,-D_FORTIFY_SOURCE=3 -Wp,-D_GLIBCXX_ASSERTIONS -std=c++11 -DNDEBUG" ; \
cd /rapidyaml/ ; \
sys=powerpc64le ; \
bdir=build/$sys ; \
mkdir -p $bdir ; \
cmake -S . -B $bdir \
  -D CMAKE_BUILD_TYPE=Release \
  -DRYML_BUILD_TESTS=ON \
  -DGIT=/usr/bin/git    \\
  -DCMAKE_CXX_FLAGS_RELEASE:STRING="$cxxflags"      \
   --toolchain .github/toolchains/linux-gnu-$sys.cmake  ;  \
for t in quickstart parse_engine_6_qmrk parse_engine_7_seqimap tree emit json merge seq map scalar_null anchor number ; do \
cmake --build $bdir -j --verbose --target ryml-test-$t \
  && qemu-ppc64le-static $bdir/test/ryml-test-$t \
  || echo "\n\n\nFAILED\n\n" ; \
done )
musicinmybrain commented 3 weeks ago

Hmm, maybe this one could be toolchain-specific after all.

I reduced the command line to CXXFLAGS="-O2 -g -Wp,-U_FORTIFY_SOURCE,-D_FORTIFY_SOURCE=3" cmake -S . -B build.

There are official Fedora docker images at https://hub.docker.com/_/fedora/, but I do not have any insight into combining docker with qemu. I suppose you might be able to run mock and qemu-user-static inside a Docker container.

You could also use a VM, either installing from an ISO (https://fedoraproject.org/workstation/download) or using one of the Cloud Base Images (https://fedoraproject.org/cloud/download). You could use a ppc64le VM if you know a way to run the whole VM under qemu, or you could use mock and qemu-user-static inside the VM.

All of my tests for this particular issue have been in a Fedora Rawhide chroot, with GCC 14.1.1 and binutils 2.42.90. I’ll try the same process from https://github.com/biojppm/rapidyaml/issues/464#issuecomment-2297446452 on some other releases and see if the issue can be reproduced there. Fedora 41 and 40 are both temporarily ahead of Rawhide with GCC 14.2.1; F41 has binutils 2.43 and F40 has binutils 2.41. Fedora 39 has GCC 13.3.1 and binutils 2.40. I can also try some enterprise distros, like CentOS Stream 10 with GCC 14.1.1 and binutils 2.41, Alma Linux 9 with GCC 11.4.1 and binutils 2.35.2, and Rocky Linux 8 with GCC 8.5.0 and binutils 2.30.

(I could try clang, too, but I’ll look at the various GCC’s first.)

musicinmybrain commented 3 weeks ago

Reproduced on:

Failed to reproduce on:

It’s looking very likely that this is a regression specific to GCC 14. Whether it’s a compiler bug / miscompilation – which is very possible – or perhaps a problem in rapidyaml revealed by a new optimization, I couldn’t say.

biojppm commented 2 weeks ago

Made some progress: was able to reproduce with a gcc14.2 cross-compiler, and unable to reproduce with gcc13.2. Using just -O2 was enough to trigger the problem.

I did so by getting a docker image with a gcc14(ubuntu24.10) and gcc13(ubuntu24.04) cross-compiler:

#img=ubuntu:24.04 # uses gcc13.2
img=ubuntu:24.10 # uses gcc14.2
docker pull $img ; docker run --rm -it --network host -v ~/proj:/proj $img

... then inside the container:

apt-get update ; apt-get install -y \
    g{++,cc}-powerpc64le-linux-gnu \
    build-essential \
    cmake \
    git \
    qemu-user-static
# define a function to try with different compiler flags
function try464()
{
   cxxflags="$1"
   cmkflags="$2"
   ( set -xe ; \
   cd /proj/rapidyaml/samples/add_subdirectory/ ; \
   sys=powerpc64le ; \
   bdir=build/$sys-docker ; \
   mkdir -p $bdir ; \
   cmake -S . -B $bdir \
         --toolchain ../../.github/toolchains/linux-gnu-$sys.cmake \
         -DCMAKE_BUILD_TYPE=Release \
         -DGIT=git \
         $cmkflags \
         -DCMAKE_CXX_FLAGS_RELEASE:STRING="$cxxflags" \
         ;  \
   cmake --build $bdir --target run -j --verbose )
}

... and finally:

$ try464 "-O2"                 # fails with gcc14, succeeds with gcc13
$ try464 "-O2" "-DRYML_DBG=ON" # succeeds
$ try464 "-O1"                 # succeeds
$ try464 "-O3"                 # succeeds

FWIW, here's the error. Notice it's missing the keys:

[100%] running: /proj/rapidyaml/samples/add_subdirectory/build/powerpc64le-docker/ryml-quickstart
qemu-ppc64le-static /proj/rapidyaml/samples/add_subdirectory/build/powerpc64le-docker/ryml-quickstart
WTF1:
{"": "a deer, a female deer","": "a drop of golden sun","": "a name, I call myself","": "a long long way to go"}
/proj/rapidyaml/samples/quickstart.cpp:4293: ERROR: ryml::emitrs_json<std::string>(tree) == R"({"doe": "a deer, a female deer","ray": "a drop of golden sun","me": "a name, I call myself","far": "a long long way to go"})"
/proj/rapidyaml/samples/quickstart.cpp:4294: OK! ryml::emitrs_json<std::string>(json_tree) == R"({"doe": "a deer, a female deer","ray": "a drop of golden sun","me": "a name, I call myself","far": "a long long way to go"})"
WTF2:
{"": "a deer, a female deer","": "a drop of golden sun","": "a name, I call myself","": "a long long way to go"}
/proj/rapidyaml/samples/quickstart.cpp:4299: ERROR: ss.str() == R"({"doe": "a deer, a female deer","ray": "a drop of golden sun","me": "a name, I call myself","far": "a long long way to go"})"
Completed 3 checks.
ERROR: 2/3 checks failed.
make[3]: *** [CMakeFiles/run.dir/build.make:71: CMakeFiles/run] Error 2

EDIT: @musicinmybrain I looked again the error you posted above; notice how it's different:

./ryml-quickstart
WTF1:
{"": "","": "","": "","": ""}
/builddir/rapidyaml/samples/quickstart.cpp:4293: ERROR: ryml::emitrs_json<std::string>(tree) == R"({"doe": "a deer, a female deer","ray": "a drop of golden sun","me": "a name, I call myself","far": "a long long way to go"})"
/builddir/rapidyaml/samples/quickstart.cpp:4294: OK! ryml::emitrs_json<std::string>(json_tree) == R"({"doe": "a deer, a female deer","ray": "a drop of golden sun","me": "a name, I call myself","far": "a long long way to go"})"
WTF2:
{"": "","": "","": "","": ""}
/builddir/rapidyaml/samples/quickstart.cpp:4299: ERROR: ss.str() == R"({"doe": "a deer, a female deer","ray": "a drop of golden sun","me": "a name, I call myself","far": "a long long way to go"})"
Completed 3 checks.
ERROR: 2/3 checks failed.
musicinmybrain commented 2 weeks ago

Nice! At least the association with GCC 14 seems to be consistent, and we have a cross-distro reproduction.

biojppm commented 2 weeks ago

A further finding: this seems to be related with the compiler frontend. I tried with both gcc14.1 and 14.2 for x64, and was not able to reproduce.

biojppm commented 2 weeks ago

I narrowed down the quickstart error to this function:

template<class EventHandler>
csubstr ParseEngine<EventHandler>::_maybe_filter_key_scalar_dquot(ScannedScalar const& C4_RESTRICT sc)
{
    csubstr maybe_filtered = sc.scalar;
    if(sc.needs_filter)
    {
        if(m_options.scalar_filtering())
        {
            maybe_filtered = _filter_scalar_dquot(sc.scalar);
        }
        else
        {
            _c4dbgp("dquo scalar left unfiltered");
            m_evt_handler->mark_key_scalar_unfiltered();
        }
    }
    else
    {
        _c4dbgp("dquo scalar doesn't need filtering");
    }
    return maybe_filtered;
}

When this function is called with sc.needs_filter set to false, it is returning an empty string in maybe_filtered, even though the input string at sc.scalar is non-empty. This was causing the empty keys/values in the paste above.

As with #440, enabling RYML_DBG (or adding any manual prints inside the function) made the problem go away. There were also other weird behaviours I observed during debugging, and I can tell more about them if you wish.

But eventually I tried the following:

 template<class EventHandler>
 csubstr ParseEngine<EventHandler>::_maybe_filter_key_scalar_dquot(ScannedScalar const& C4_RESTRICT sc)
 {
-    csubstr maybe_filtered = sc.scalar;
     if(sc.needs_filter)
     {
         if(m_options.scalar_filtering())
         {
-            maybe_filtered = _filter_scalar_dquot(sc.scalar);
+            return _filter_scalar_dquot(sc.scalar);
         }
         else
         {
             _c4dbgp("dquo scalar left unfiltered");
             m_evt_handler->mark_key_scalar_unfiltered();
         }
     }
     else
     {
         _c4dbgp("dquo scalar doesn't need filtering");
     }
-    return maybe_filtered;
+    return sc.scalar;
 }

Which enabled the quickstart test to succeed.

After that, I did the same to every other _maybe_filter_*() function, and ran the test suite. Doing so enabled every test to pass 🤨.

I'm pushing the changeset now; please try on your end, and let me know what happens.

musicinmybrain commented 2 weeks ago

I repeated the exact steps in https://github.com/biojppm/rapidyaml/issues/464#issuecomment-2297446452, including the “unminimized” CXXFLAGS etc., and all seems well: the ODR violation warning is absent, and the test passes:

[100%] running: /builddir/rapidyaml/samples/add_subdirectory/build/ryml-quickstart
WTF1:
{"doe": "a deer, a female deer","ray": "a drop of golden sun","me": "a name, I call myself","far": "a long long way to go"}
/builddir/rapidyaml/samples/quickstart.cpp:4293: OK! ryml::emitrs_json<std::string>(tree) == R"({"doe": "a deer, a female deer","ray": "a drop of golden sun","me": "a name, I call myself","far": "a long long way to go"})"
/builddir/rapidyaml/samples/quickstart.cpp:4294: OK! ryml::emitrs_json<std::string>(json_tree) == R"({"doe": "a deer, a female deer","ray": "a drop of golden sun","me": "a name, I call myself","far": "a long long way to go"})"
WTF2:
{"doe": "a deer, a female deer","ray": "a drop of golden sun","me": "a name, I call myself","far": "a long long way to go"}
/builddir/rapidyaml/samples/quickstart.cpp:4299: OK! ss.str() == R"({"doe": "a deer, a female deer","ray": "a drop of golden sun","me": "a name, I call myself","far": "a long long way to go"})"
Completed 3 checks.
SUCCESS!
WTF1:
{"doe": "a deer, a female deer","ray": "a drop of golden sun","me": "a name, I call myself","far": "a long long way to go"}
/builddir/rapidyaml/samples/quickstart.cpp:4293: OK! ryml::emitrs_json<std::string>(tree) == R"({"doe": "a deer, a female deer","ray": "a drop of golden sun","me": "a name, I call myself","far": "a long long way to go"})"
/builddir/rapidyaml/samples/quickstart.cpp:4294: OK! ryml::emitrs_json<std::string>(json_tree) == R"({"doe": "a deer, a female deer","ray": "a drop of golden sun","me": "a name, I call myself","far": "a long long way to go"})"
WTF2:
{"doe": "a deer, a female deer","ray": "a drop of golden sun","me": "a name, I call myself","far": "a long long way to go"}
/builddir/rapidyaml/samples/quickstart.cpp:4299: OK! ss.str() == R"({"doe": "a deer, a female deer","ray": "a drop of golden sun","me": "a name, I call myself","far": "a long long way to go"})"
Completed 3 checks.
SUCCESS!
[100%] Built target run

Next I’ll try applying https://github.com/biojppm/rapidyaml/compare/master...fix/464_ub_ppc64le as a patch to 0.7.1, and double-check that the full test suite passes on all architectures.

musicinmybrain commented 2 weeks ago

Next I’ll try applying master...fix/464_ub_ppc64le as a patch to 0.7.1, and double-check that the full test suite passes on all architectures.

That worked. I also managed a successful test-build of jsonnet, which is the only thing in Fedora that currently depends on rapidyaml. It would be nice to really understand the root cause, but at least everything seems to be working.

Thanks for taking the time to investigate all of these reports. Are you planning a release with the fixes, or should I patch 0.7.1?

biojppm commented 2 weeks ago

Don't patch, I'll cleanup this branch and make a new release.

I'm somewhat bothered by this situation (eg, this issue, and #440, and the previous problem referred therein).

The functions I now had to change look innocuous, and their call sites also look innocuous. It is remarkable that the optimizer is choking on this code, and it is even more remarkable that -O3 works while -O2 doesn't.

Even if UB is the ultimate reason, I find it hard to surmise how that is so. So do the many static analysis tools, and none of them point at anything suspicious.

Given that this issue is so specific (ie only gcc14+ppc64le+-O2), having a test in the CI does not make much sense (or otherwise the already gigantic size of this project's CI would balloon). So I'm just going to do the changes and merge them in.

musicinmybrain commented 2 weeks ago

If this turns out to be a bug/regression in the ppc64le frontend for GCC, it would not be the first one I’ve seen, e.g. https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=7dd3b2b09cbeb6712ec680a0445cb0ad41070423 which showed up in the tests for https://github.com/shibatch/sleef. Hard to characterize exactly what’s happening and distill it into something reportable, though.

musicinmybrain commented 2 weeks ago

The functions I now had to change look innocuous, and their call sites also look innocuous. It is remarkable that the optimizer is choking on this code, and it is even more remarkable that -O3 works while -O2 doesn't.

I remember that you said -O3 worked in your CI. It didn’t work in my local tests, though.

biojppm commented 2 weeks ago

Oh, right, the segfault. Well, I did try with -O3 only, but that was working.

I guess I'll try again with the full flags.

biojppm commented 2 weeks ago

Did you try -O3 with this changeset? If not, can you try it?

biojppm commented 2 weeks ago

I just tried again on the tip of the branch with gcc 14.2. All of these succeed:

try464 "-O2"
try464 "-O3"
try464 "-O2 -fexceptions -g -grecord-gcc-switches -pipe -Wall -Wp,-U_FORTIFY_SOURCE,-D_FORTIFY_SOURCE=3 -Wp,-D_GLIBCXX_ASSERTIONS -fstack-protector-strong"
try464 "-O3 -fexceptions -g -grecord-gcc-switches -pipe -Wall -Wp,-U_FORTIFY_SOURCE,-D_FORTIFY_SOURCE=3 -Wp,-D_GLIBCXX_ASSERTIONS -fstack-protector-strong"
biojppm commented 2 weeks ago

Since #467 is a proven fix to at least some of the problems here, I'm merging it now; doing so will close the issue. Feel free to reopen if the segfault persists.

musicinmybrain commented 2 weeks ago

Did you try -O3 with this changeset? If not, can you try it?

I just now tried https://github.com/biojppm/rapidyaml/issues/464#issuecomment-2297446452 again, with the following differences:

All the tests passed.

biojppm commented 2 weeks ago

@musicinmybrain v0.7.2 is now available. Thanks for your help in chasing this issue.

musicinmybrain commented 2 weeks ago

@musicinmybrain v0.7.2 is now available. Thanks for your help in chasing this issue.

Thank you! This seems to be working well for me, so I’m going to update the rapidyaml package in Fedora from 0.6.0 to 0.7.2 in Fedora Rawhide (future Fedora 42) and in the upcoming Fedora 41. This will also be the initial version (and probably the only version due to ABI compatibility) of rapidyaml in EPEL10.