ColinPitrat / caprice32

An emulator of the Amstrad CPC 8bit home computer range.
GNU General Public License v2.0
146 stars 32 forks source link

Bus error on NetBSD 9.1/sparc64 #190

Open connorwit opened 3 years ago

connorwit commented 3 years ago

I compiled caprice32 4.6.0 on NetBSD 9.1/sparc64. When i try to run it i get a bus error.

Unfortunately i don't get any detailed point of error in the console or with gdb. What can i do to produce some debug output or gdb backtrace to possibly find the cause of the error?

Regards, Connor

ColinPitrat commented 3 years ago

You can build with:

make clean # to clean previous build
make debug

Then run with -v:

./cap32 -v | tee cap32.log

This should provide some useful information. I'm not sure why you don't get anything useful from gdb. Are you getting a core?

connorwit commented 3 years ago

Yes, I can provide a core if you need, but first to the debug problems: I have set the following variables and tried to compile caprice32:

DEBUG=TRUE
WITHOUT_DL=TRUE

It errored out with:

src/cap32.h:367:43: error: 'gcc_struct' attribute directive ignored [-Werror=attributes]
       } __attribute__((packed, gcc_struct));
                                           ^
cc1plus: all warnings being treated as errors

So i had remove -Werror from the DEBUG_FLAGS: DEBUG_FLAGS = -Werror -g -O0 -DDEBUG

Now it get a bus error when the testing framework is active and cannot complete the debug build:

[----------] 4 tests from argParseTest
[ RUN      ] argParseTest.cfgFileArgsSwitch
[       OK ] argParseTest.cfgFileArgsSwitch (0 ms)
[ RUN      ] argParseTest.replaceCap32KeysKeywords
[       OK ] argParseTest.replaceCap32KeysKeywords (0 ms)
[ RUN      ] argParseTest.replaceCap32KeysRepeatedKeywords
[       OK ] argParseTest.replaceCap32KeysRepeatedKeywords (0 ms)
[ RUN      ] argParseTest.replaceCap32KeysNoKeyword
[       OK ] argParseTest.replaceCap32KeysNoKeyword (0 ms)
[----------] 4 tests from argParseTest (1 ms total)

[----------] 4 tests from Zip
[ RUN      ] Zip.DirOnFileWithNoMatchingEntry
gmake: *** [makefile:272: unit_test] Bus error (core dumped)
*** Error code 2

Any idea what to do now?

ColinPitrat commented 3 years ago

The test doesn't complete but the binary is built, so you can try running caprice with debug logs, although I'm not sure this will help much:

./cap32 -v | tee cap32.log

Because the test fails on Zip.DirOnFileWithNoMatchingEntry which really doesn't do much (only call dir from the zip.cc file) this may be relatively easy to investigate. For example, you can add logs (with std::cout << message << std::endl) at regular interval in the method and rebuild to test it.

As the test run in random order, you may need to run it manually to run just the test that fails:

./test_runner --gtest_filter=Zip.DirOnFileWithNoMatchingEntry

But because launching cap32 alone shouldn't call the zip::dir function unless you provide a zip file in argument, it's likely that there are other issues elsewhere in the code. There may be some assumptions in some places that are violated on sparc64. I'd be interested to dig this more. I highly suspect the issue is unaligned access.

connorwit commented 3 years ago

Attached is a cap32.log of the debug binary and a .core of that run.

cap32-buserror.zip

ColinPitrat commented 3 years ago

I won't be able to open the core on a non-sparc64 machine. But having the core and a debug binary you can do:

$ gdb ./cap32 core
gdb> bt

Which should give you the exact line where it crashes. Unfortunately, if it's unaligned memory access it's likely to occur all over the code and will be very hard to fix.

connorwit commented 3 years ago

Result of the first debug is:

Core was generated by `cap32'.
Program terminated with signal SIGBUS, Bus error.
#0  0x000000000011dcc8 in prerender_sync_half () at src/crtc.cpp:817
817        *RendPos = dwVal;
[Current thread is 1 (process 1)]
(gdb) bt
#0  0x000000000011dcc8 in prerender_sync_half () at src/crtc.cpp:817
#1  0x000000000011e8f8 in crtc_cycle (repeat_count=1) at src/crtc.cpp:1066
#2  0x000000000019c518 in z80_execute () at src/z80.cpp:1004
#3  0x0000000000116e64 in cap32_main (argc=<optimized out>, argv=<optimized out>) at src/cap32.cpp:2286
#4  0x00000000001066bc in ___start ()
#5  0x000000004040178c in _rtld_start () from /usr/libexec/ld.elf_so
ColinPitrat commented 3 years ago

Weird, I would expect this access to be aligned as RendPos is a dword*. Maybe it's not unaligned access in the end?

Can you try the following in gdb (from the same core):

(gdb) p RendPos
connorwit commented 3 years ago
Core was generated by `cap32'.
Program terminated with signal SIGBUS, Bus error.
#0  0x000000000011dcc8 in prerender_sync_half () at src/crtc.cpp:817
817        *RendPos = dwVal;
[Current thread is 1 (process 1)]
(gdb) p RendPos
$1 = (dword *) 0x35c5d6 <RendBuff+6>
ColinPitrat commented 3 years ago

Ah indeed, RendPos is not the allocated thing, RendBuff is and it's byte[]. And we can point at word boundaries instead of dword boundaries in the half pixel case. Setting scr_style=1 should avoid this particular one, but is unlikely to make it work fully. I really suspect many other places will do unaligned access.

Fixing all of them (or even some) would potentially be interesting as they can have a performance impact even on other architectures. It's likely to be a lot of work and will not go well if I have to ask you to test and report for every single one of them. So I'd need a way to debug it myself.

ColinPitrat commented 3 years ago

-fsanitize=alignment should do it. This requires to add -lubsan when linking.

ColinPitrat commented 3 years ago

Bad news is that it doesn't detect any issue when running the Zip test case, so it may not be a silver bullet ...

connorwit commented 3 years ago

You could set up a qemu sparc64 with NetBSD, but i am not sure if this will help, since it has no working gpu emulation at the moment, only sparc 32bit has one. I used the caprice32 from pkgsrc. Packages for sparc64 are available at: https://pkg.zia.io/pub/pkgsrc/packages/NetBSD/sparc64/ I am unsure about other big endian systems and emulations.

connorwit commented 3 years ago

I have set scr_style=1, but the error jumps to another section of crtc.cpp:

#0  0x000000000011dc68 in prerender_sync () at src/crtc.cpp:805
805        *RendPos = dwVal;
ColinPitrat commented 3 years ago

So here is what it requires to verify alignment:

$ git diff makefile
diff --git a/makefile b/makefile
index aa60480..e604402 100644
--- a/makefile
+++ b/makefile
@@ -107,7 +107,7 @@ TEST_OBJECTS:=$(TEST_DEPENDS:.d=.o)

 WARNINGS = -Wall -Wextra -Wzero-as-null-pointer-constant -Wformat=2 -Wold-style-cast -Wmissing-include-dirs -Wlogical-op -Woverloaded-virtual -Wpointer-arith -Wredundant-decls
 COMMON_CFLAGS += $(CFLAGS) -std=c++17 $(IPATHS)
-DEBUG_FLAGS = -Werror -g -O0 -DDEBUG
+DEBUG_FLAGS = -Werror -g -O0 -DDEBUG -fsanitize=alignment
 RELEASE_FLAGS = -O2 -funroll-loops -ffast-math -fomit-frame-pointer -fno-strength-reduce -finline-functions -s
 BUILD_FLAGS = $(RELEASE_FLAGS)

@@ -117,6 +117,7 @@ ifndef DEBUG
 ifeq ($(LAST_BUILD_IN_DEBUG), 1)
 FORCED_DEBUG = 1
 DEBUG = 1
+LIBS += -lubsan
 endif
 endif

If the alignment of a buffer is the only issue, it can be fixed with something like:

-byte RendBuff[800];
+byte RendBuff[800] __attribute__((aligned (32)));

Unfortunately, it seems most of the issues won't be fixable (at least not easily).

PSG

For example, there's one in psg.cpp that comes from RegisterAY:

   union {
      unsigned char Index[16];
      struct {
         unsigned char TonALo, TonAHi;
         unsigned char TonBLo, TonBHi;
         unsigned char TonCLo, TonCHi;
         unsigned char Noise;
         unsigned char Mixer;
         unsigned char AmplitudeA, AmplitudeB, AmplitudeC;
         unsigned char EnvelopeLo, EnvelopeHi;
         unsigned char EnvType;
         unsigned char PortA;
         unsigned char PortB;
      };  
      struct {
         unsigned short TonA;
         unsigned short TonB;
         unsigned short TonC;
         unsigned char _noise, _mixer, _ampa, _ampb, _ampc;
         unsigned short Envelope;
         unsigned char _envtype, _porta, portb;
      } __attribute__((packed, gcc_struct));
   } RegisterAY;

Accessing Envelope as an unsigned short is, by design, unaligned. We could align it by reordering values or adding an unused char after ampc but this would mess with the register indexes, so all accesses through Index would need to be modified to map the register to a different offset. This is not unfeasible but adds requires some significant refactoring with unknown impact on performances (probably low though, registered are not used that often).

CRTC

More problematic is an issue in crtc.cpp:

byte RendBuff[800];
byte *RendWid, *RendOut;

(...)

   *RendPos = dwVal;

(...)

         dword val = (HorzPos & 0xf0) >> PosShift;

(...)

            RendPos = reinterpret_cast<dword *>(&RendBuff[val]);

Adding some log around the val variable shows that it can take any value (typically 12, then 9 (causing unaligned access), then 6, 5, 3, 2, 2, 1 ...). We cannot just do a few accesses at byte level to align and then continue with dword. I can try to ensure val is always aligned and see what it gives, but it's unlikely to be fine ...

ColinPitrat commented 3 years ago

OK, I gave a try at it on the aligned branch. It seems the situation is not as bad as I initially thought.

You can try it with the following:

git checkout aligned
make
./cap32

If it doesn't work, you can retry the make clean && make debug ; ./cap32 -v

connorwit commented 3 years ago

Preliminary result is that cap32 starts up without a bus error. When scr_style=1 i get a black screen with white fps and % readings in the upper left, if i set scr_style=8 i get a red screen with white ftp an % writings. It is very slow and didn't reach a visible CPC tartup screen. Also when i close cap32 i get a bus error, but i guess thats negligable for now. I have to analyze it a bit, maybe my OpenGL deactivation didn't work properly.

ColinPitrat commented 3 years ago

Ah, I didn't realize sparc64 was big endian! It's not surprising that it fails. Caprice32 assumes little endianness in many places.

You can try:

git pull
make unit_test
./test_runner --gtest_filter=CrtcTest.NewDtCombinedAccess

I'd expect this new CrtcTest.NewDtCombinedAccess to fail because of the endianness. I'll look into the extent of things that would need to be fixed to work on a big endian architecture.

ColinPitrat commented 3 years ago

I fixed the obvious places where little endianness is assumed. I would be surprised if it's enough to work, but let me know at least if the unit tests pass.

connorwit commented 3 years ago

I have a problem compiling the new tests:

makefile:76: Notice: APP_PATH not specified.  Will look for cap32.cfg debug-style.  See `README.md` for details. 
g++ -c -O2 -funroll-loops -ffast-math -fomit-frame-pointer -fno-strength-reduce -finline-functions -s -fPIC -DHASH=\"b626b7318852beacd8910c98e0e58d4f6e70a908\"  -std=c++17 -Isrc/ -Isrc/gui/includes `pkg-config --cflags freetype2` `sdl2-config --cflags` `pkg-config --cflags libpng` -Igoogletest/googletest//include -Igoogletest/googletest/ -Igoogletest/googlemock//include -Igoogletest/googlemock/ -o obj/linux/test/keyboard.o test/keyboard.cpp
test/keyboard.cpp:2:10: fatal error: filesystem: No such file or directory
 #include <filesystem>
          ^~~~~~~~~~~~

The only "filesystem" i found on my NetBSD didn't have the required declaration:

test/keyboard.cpp: In member function 'virtual void KeyboardTest_parseAllLayoutFiles_Test::TestBody()':
test/keyboard.cpp:8:34: error: 'std::filesystem' has not been declared
   for (const auto & entry : std::filesystem::directory_iterator("./resources"))
                                  ^~~~~~~~~~
ColinPitrat commented 3 years ago

That's weird, this test is not new so if you managed to compile tests before, it should still work. This is a standard header for C++ introduced in C++17: https://en.cppreference.com/w/cpp/header/filesystem

Did you change something in your system? Anyway, even if the tests don't work you can try the emulator itself to see if it's better with the fix on unions.

You can also skip this particular test by just removing the file test/keyboard.cpp

What gcc version are you using? (what's the output of gcc -v?)

connorwit commented 3 years ago

Before i was compiling the cacprice32 4.6.0 from the package distribution of NetBSD. test/keyboard.cpp is not present there, when it retrieves the original source file.


bash-5.0# gcc -v 
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/libexec/lto-wrapper
Target: sparc64--netbsd
Configured with: /usr/src/tools/gcc/../../external/gpl3/gcc/dist/configure --target=sparc64--netbsd --enable-long-long --enable-threads --with-bugurl=http://www.NetBSD.org/Misc/send-pr.html --with-pkgversion='NetBSD nb4 20200810' --with-system-zlib --without-isl --enable-__cxa_atexit --enable-libstdcxx-time=rt --enable-libstdcxx-threads --with-diagnostics-color=auto-if-env --with-default-libstdcxx-abi=new --with-mpc-lib=/var/obj/mknative/sparc64/usr/src/external/lgpl3/mpc/lib/libmpc --with-mpfr-lib=/var/obj/mknative/sparc64/usr/src/external/lgpl3/mpfr/lib/libmpfr --with-gmp-lib=/var/obj/mknative/sparc64/usr/src/external/lgpl3/gmp/lib/libgmp --with-mpc-include=/usr/src/external/lgpl3/mpc/dist/src --with-mpfr-include=/usr/src/external/lgpl3/mpfr/dist/src --with-gmp-include=/usr/src/external/lgpl3/gmp/lib/libgmp/arch/sparc64 --enable-tls --disable-multilib --disable-libstdcxx-pch --build=sparc64--netbsd --host=sparc64--netbsd --with-sysroot=/var/obj/mknative/sparc64/usr/src/destdir.sparc64
Thread model: posix
gcc version 7.5.0 (nb4 20200810) 
connorwit commented 3 years ago

Attached is a run of the test suite without keyboard.cpp:

20210401-cap32-testresult.txt

connorwit commented 3 years ago

We are getting somewhere. Attached is a picture with OpenGL enabled. It is blurred, the colors are wrong and it is very slow (the machine has no GL-acceleration). It is also very very slow reacting to keyboard presses. 2021-04-01-220744_768x540_scrot

ColinPitrat commented 3 years ago

Awesome, that's some good progress. Both the issue with colors and with characters are a byte order issue. What value of scr_style are you using? You can try various ones to see if performances improve. You can also rebuild in release with make clean ; make which should improve performances significantly.

I setup a build with a big endian arch on Travis and I do get the same test failure for Zip (https://travis-ci.org/github/ColinPitrat/caprice32/jobs/765533669). I don't get the one for the configuration but that's not a big deal, it's just a test that verifies that writing to a read-only file fails. Not sure why it's not the case in your case, either that the write to the read-only file succeeds (which would be very weird) or that the error is not properly surfaced by the API either. Could be NetBSD specific.

I will probably not have a look at this issue in the coming days but I won't forget it.

connorwit commented 3 years ago

Probably writing to read-only file succeeds because i'm running stuff on these machines as root :)

ColinPitrat commented 3 years ago

Oh yeah, that explains :-)