TigerVNC / tigervnc

High performance, multi-platform VNC client and server
https://tigervnc.org
GNU General Public License v2.0
5.25k stars 956 forks source link

Tigervnc-1.14.0 crashes with SIGBUS on SPARC architecture #1859

Open SadClouds opened 4 weeks ago

SadClouds commented 4 weeks ago

This is a regression since 1.13.1 below is a backtrace

Program received signal SIGBUS, Bus error.
rfb::EncodeManager::checkSolidTile<unsigned char*> (this=0x407d51b8, pb=0x407dc008, colourValue=0xffffffffffffc73c "", r=...) at /opt/netbsd/pkg.objects/net/tigervnc/work.ultra10/tigervnc-1.14.0/common/rfb/EncodeManager.cxx:1081

(gdb) bt
#0  rfb::EncodeManager::checkSolidTile<unsigned char*> (this=0x407d51b8, pb=0x407dc008, colourValue=0xffffffffffffc73c "", r=...) at /opt/netbsd/pkg.objects/net/tigervnc/work.ultra10/tigervnc-1.14.0/common/rfb/EncodeManager.cxx:1081
#1  rfb::EncodeManager::findSolidRect (this=this@entry=0x407d51b8, rect=..., changed=changed@entry=0xffffffffffffc908, pb=pb@entry=0x407dc008) at /opt/netbsd/pkg.objects/net/tigervnc/work.ultra10/tigervnc-1.14.0/common/rfb/EncodeManager.cxx:686
#2  0x0000000000311330 in rfb::EncodeManager::writeSolidRects (this=this@entry=0x407d51b8, changed=changed@entry=0xffffffffffffc908, pb=pb@entry=0x407dc008) at /opt/netbsd/pkg.objects/net/tigervnc/work.ultra10/tigervnc-1.14.0/common/rfb/EncodeManager.cxx:658
#3  0x0000000000313664 in rfb::EncodeManager::doUpdate (this=this@entry=0x407d51b8, allowLossy=allowLossy@entry=true, changed_=..., copied=..., copyDelta=..., pb=pb@entry=0x407dc008, renderedCursor=0x40720418)
    at /opt/netbsd/pkg.objects/net/tigervnc/work.ultra10/tigervnc-1.14.0/common/rfb/EncodeManager.cxx:360
#4  0x000000000031376c in rfb::EncodeManager::writeUpdate (this=this@entry=0x407d51b8, ui=..., pb=0x407dc008, renderedCursor=renderedCursor@entry=0x40720418) at /opt/netbsd/pkg.objects/net/tigervnc/work.ultra10/tigervnc-1.14.0/common/rfb/EncodeManager.cxx:284
#5  0x000000000030c518 in rfb::VNCSConnectionST::writeDataUpdate (this=this@entry=0x407d4e00) at /opt/netbsd/pkg.objects/net/tigervnc/work.ultra10/tigervnc-1.14.0/common/rfb/VNCSConnectionST.cxx:1033
#6  0x000000000030c980 in rfb::VNCSConnectionST::writeFramebufferUpdate (this=this@entry=0x407d4e00) at /opt/netbsd/pkg.objects/net/tigervnc/work.ultra10/tigervnc-1.14.0/common/rfb/VNCSConnectionST.cxx:914
#7  0x000000000030ca58 in rfb::VNCSConnectionST::processMessages (this=0x407d4e00) at /opt/netbsd/pkg.objects/net/tigervnc/work.ultra10/tigervnc-1.14.0/common/rfb/VNCSConnectionST.cxx:192
#8  0x00000000002f24bc in XserverDesktop::handleSocketEvent (this=0x0, this@entry=0x407dc000, fd=fd@entry=7, sockserv=0x40720300, read=read@entry=true, write=write@entry=false) at XserverDesktop.cc:363
#9  0x00000000002f2550 in XserverDesktop::handleSocketEvent (this=0x407dc000, fd=<optimized out>, read=<optimized out>, write=<optimized out>) at XserverDesktop.cc:315
#10 0x00000000002d0538 in HandleNotifyFd (fd=<optimized out>, xevents=<optimized out>, data=0x40bfc520) at connection.c:818
#11 0x00000000002d52a8 in ospoll_wait (ospoll=0x405854c0, timeout=<optimized out>) at ospoll.c:681
#12 0x00000000002cd858 in WaitForSomething (are_ready=<optimized out>) at WaitFor.c:208
#13 0x000000000027c1fc in Dispatch () at dispatch.c:421
#14 0x0000000000280ea0 in dix_main (argc=<optimized out>, argv=0xffffffffffffd2f8, envp=<optimized out>) at main.c:276
#15 0x000000000017d7fc in ___start ()
#16 0x000000004060178c in _rtld_start () from /usr/libexec/ld.elf_so
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
CendioOssman commented 4 weeks ago

Thank you for your report.

Do you think you would be able to do a git bisect to narrow down the commit that caused this issue?

SadClouds commented 4 weeks ago

The issue appears to be on this line (whichever git commit that is) tigervnc-1.14.0/common/rfb/EncodeManager.cxx:1081

SIGBUS is a misaligned load/store. I'm not a C++ programmer, but normally casting a pointer to a different type and dereferencing it, causes these types of issues.

CendioOssman commented 4 weeks ago

That code is unfortunately very complex, so the crashing line is likely not where the issue originates. And we don't have any Sparc machines here, so we have a hard time tracking this down without someone helping to pinpoint where things went wrong.

SadClouds commented 4 weeks ago

Which is why I'm trying to help. The SPARC machine I have is very slow, it takes about half a day to build one VNC package, so bisecting many different versions and building them could take days and I don't have the capacity for this. The backtrace points exactly where the alignment issue occurred. I don't use C++ much, hence after looking at it quickly, it is not obvious to me why data access in this template is misaligned. It may be a better approach for developers to review the code around this line and suggest possible fixes. I'm happy to test them and report.

The problem originates on line 1075 buffer = (const T*)pb->getBuffer(r, &stride);

where getBuffer() returns a pointer, which is then cast to pointer to object T. The pointer returned is not aligned properly for object T. This looks like bad C++ programming.

SadClouds commented 3 weeks ago

I added some debug statements

template<class T>
inline bool EncodeManager::checkSolidTile(const Rect& r,
                                          const T colourValue,
                                          const PixelBuffer *pb)
{
  int w, h;
  const T* buffer;
  int stride, pad;

  w = r.width();
  h = r.height();

  buffer = (const T*)pb->getBuffer(r, &stride);
  vlog.error("buffer=%p", buffer);
  pad = stride - w;

  while (h--) {
    int w_ = w;
    while (w_--) {
      vlog.error("buffer=%p, h=%d, w=%d", buffer, h, w_);
      if (*buffer != colourValue)
        return false;
      buffer++;
      vlog.error("buffer++=%p", buffer);
    }
    buffer += pad;
  }

  return true;
}
Wed Oct 30 09:12:18 2024
 VNCSConnST:  Server default pixel format depth 24 (32bpp) big-endian rgb888
 VNCSConnST:  Client pixel format depth 24 (32bpp) little-endian rgb888
 EncodeManager: buffer=0x46e00400
 EncodeManager: buffer=0x46e00400, h=15, w=15
 EncodeManager: buffer=0x46e00440
 EncodeManager: buffer=0x46e00440, h=15, w=15
 EncodeManager: buffer=0x46e00480
...
 EncodeManager: buffer=0x47259080, h=15, w=15
 EncodeManager: buffer=0x472590c0
 EncodeManager: buffer=0x472590c0, h=15, w=8
 EncodeManager: buffer=0x47259124
 EncodeManager: buffer=0x47259124, h=15, w=15
(EE) Bus error at address 0x47259124
(EE) 
Fatal server error:
(EE) Caught signal 10 (Bus error). Server aborting
(EE) 
X connection to :1 broken (explicit kill or server shutdown).
Killing Xvnc process ID 7712
Xvnc process ID 7712 already killed

It seems the difference between successive buffer pointer addresses is 64 bytes, the last one which causes SIGBUS is 0x47259124 - 0x472590c0 = 100 bytes. Either way, looks like the buffer pointer may be pointing at a bogus address.

CendioOssman commented 3 weeks ago

0x47259124 seems aligned well enough for 32-bit pixels. Any idea why it is still giving us SIGBUS?

SadClouds commented 3 weeks ago

Because pointers on sparc64 must be aligned on 8 byte boundaries

gdb -p $(pgrep Xvnc)
...
Program received signal SIGBUS, Bus error.
...
(gdb) print buffer
$9 = (unsigned char * const *) 0x47259124
(gdb) print sizeof(buffer)
$10 = 8
(gdb) print 0x47259124 % 8
$11 = 4

So

if (*buffer != colourValue)

is trying to dereference a pointer which is only aligned on 4 byte boundary. Hence this code is broken in the functions which allocate and return the address for that buffer. That pointer must be aligned on 8 byte boundary, i.e. buffer % 8 must be 0.

CendioOssman commented 3 weeks ago

That is something I don't think we can easily support. The graphics stack expects packed pixels. And we need to be able to access individual pixels.

And are you sure about that requirement? Every hit I get states that sparc64 has the normal alignment requirements of alignment matching the type size.

SadClouds commented 3 weeks ago

I got my previous gdb statement a bit wrong. The buffer variable seems to be a pointer to pointer, this is where I think the problem is

(gdb) p colourValue
$14 = (unsigned char * const) 0xffffffffffffc72c ""
(gdb) p buffer
$15 = (unsigned char * const *) 0x47259124
(gdb) p *buffer
$16 = (unsigned char * const) 0x45444f0045444f <error: Cannot access memory at address 0x45444f0045444f>
(gdb) p 0x45444f0045444f % 8
$17 = 7

So *buffer results in a data type unsigned char * which is a pointer and must be aligned on 8 bytes. The line *buffer != colourValue seems to compare two pointers, but *buffer results misaligned pointer load, causing SIGBUS. On sparc64 pointers are 8 bytes hence the alignment issue.

SadClouds commented 3 weeks ago

You have the same issue on x86, but instead of SIGBUS, CPU will instead use misaligned load, which could be less efficient, i.e. instead of one load, it may issue two loads.