NCAR / VAPOR

VAPOR is the Visualization and Analysis Platform for Ocean, Atmosphere, and Solar Researchers
https://www.vapor.ucar.edu/
BSD 3-Clause "New" or "Revised" License
178 stars 49 forks source link

Vapor crashing on NOAA's HERA Cluster #2961

Closed sgpearse closed 11 months ago

sgpearse commented 2 years ago

Gyorgy Fekete and Ka Yee Wong from NOAA report that Vapor is crashing on the HERA GPU cluster, with the following error:

$ ./VAPOR3-3.5.0-Linux/bin/vapor qt.qpa.plugin: Could not load the Qt platform plugin "xcb" in "" even though it was found. This application failed to start because no Qt platform plugin could be initialized. Reinstalling the application may fix this problem. Available platform plugins are: xcb. Aborted

There's commotion about this error on the Qt forums, here.

Two of the problems could be:

1) Running on a headless server
2) A missing libxcb library

Follow up questions: 1) Is there a glx forwarding application like TurboVNC or TigerVNC being run when you are starting Vapor? 2) Can you run Vapor with the environment variable _QT_DEBUGPLUGINS=1 ? 2) Is there a libxcb.so library installed on HERA? When I do ldd on Vapor running on our graphics cluster Casper, I get the following, which finds libxcb.so.1, shown below. If ldd is run on Vapor running on Hera, is there any equivalent found?

Summarized output of ldd run on Casper's version of Vapor

bash-4.2$ ldd /glade/u/apps/dav/opt/vapor/3.5.0/bin/vapor
        libxcb.so.1 => /lib64/libxcb.so.1 (0x00002b2847c02000)

Full output of ldd run on Casper's installation of Vapor:

bash-4.2$ ldd /glade/u/apps/dav/opt/vapor/3.5.0/bin/vapor
        linux-vdso.so.1 =>  (0x00007ffc511a9000)
        libdlfaker.so => /usr/local/lib64/libdlfaker.so (0x00002b28451d9000)
        libvglfaker.so => /usr/local/lib64/libvglfaker.so (0x00002b28453dc000)
        libc.so.6 => /lib64/libc.so.6 (0x00002b284569f000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00002b2845a6d000)
        libGL.so.1 => /lib64/libGL.so.1 (0x00002b2845c71000)
        libturbojpeg.so.0 => /usr/local/lib64/libturbojpeg.so.0 (0x00002b2845f11000)
        libX11.so.6 => /lib64/libX11.so.6 (0x00002b28461c3000)
        libXext.so.6 => /lib64/libXext.so.6 (0x00002b2846501000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00002b2846713000)
        libssl.so.10 => /lib64/libssl.so.10 (0x00002b284692f000)
        libcrypto.so.10 => /lib64/libcrypto.so.10 (0x00002b2846ba1000)
        libstdc++.so.6 => /glade/u/apps/dav/opt/gnu/9.1.0/lib64/libstdc++.so.6 (0x00002b2847004000)
        libm.so.6 => /lib64/libm.so.6 (0x00002b28473dd000)
        /lib64/ld-linux-x86-64.so.2 (0x00002b2844fb5000)
        libGLX.so.0 => /lib64/libGLX.so.0 (0x00002b28476df000)
        libGLdispatch.so.0 => /lib64/libGLdispatch.so.0 (0x00002b2847911000)
        libxcb.so.1 => /lib64/libxcb.so.1 (0x00002b2847c02000)
        libgssapi_krb5.so.2 => /lib64/libgssapi_krb5.so.2 (0x00002b2847e2a000)
        libkrb5.so.3 => /lib64/libkrb5.so.3 (0x00002b2848077000)
        libcom_err.so.2 => /lib64/libcom_err.so.2 (0x00002b2848360000)
        libk5crypto.so.3 => /lib64/libk5crypto.so.3 (0x00002b2848564000)
        libz.so.1 => /lib64/libz.so.1 (0x00002b2848797000)
        libgcc_s.so.1 => /glade/u/apps/dav/opt/gnu/9.1.0/lib64/libgcc_s.so.1 (0x00002b28489ad000)
        libXau.so.6 => /lib64/libXau.so.6 (0x00002b2848bc5000)
        libkrb5support.so.0 => /lib64/libkrb5support.so.0 (0x00002b2848dc9000)
        libkeyutils.so.1 => /lib64/libkeyutils.so.1 (0x00002b2848fd9000)
        libresolv.so.2 => /lib64/libresolv.so.2 (0x00002b28491dd000)
        libselinux.so.1 => /lib64/libselinux.so.1 (0x00002b28493f7000)
        libpcre.so.1 => /lib64/libpcre.so.1 (0x00002b284961e000)
gyorgy-fekete commented 2 years ago

Thanks for starting this, Scott. I'll answer the question this weekend. We don't have turbo or tiger vnc, but we do use x2go and MATE for logging in on a front-end node.

We port forward X11 traffic for "normal" graphics stuff. A GPU node has 8 Tesla P-100's with 16GB of device memory each.

I'll do more looking tomorrow... Cheers, George

gyorgy-fekete commented 2 years ago

Thanks you for these questions, I'll do my best to answer them: 1. Is there a glx forwarding application like TurboVNC or TigerVNC being run when you are starting Vapor? There is an x2go server, and x2go client on the localhost. No [Turbo,Tiger]VNC.

2. Can you run Vapor with the environment variable QT_DEBUG_PLUGINS=1 ?

QFactoryLoader::QFactoryLoader() checking directory path "/tds_scratch1/SYSADMIN/nesccmgmt/Gyorgy.Fekete/vapor/VAPOR3-3.5.0-Linux/lib/platforms" ...
QFactoryLoader::QFactoryLoader() looking at "/tds_scratch1/SYSADMIN/nesccmgmt/Gyorgy.Fekete/vapor/VAPOR3-3.5.0-Linux/lib/platforms/libqxcb.so"
Found metadata in lib /tds_scratch1/SYSADMIN/nesccmgmt/Gyorgy.Fekete/vapor/VAPOR3-3.5.0-Linux/lib/platforms/libqxcb.so, metadata=
{
    "IID": "org.qt-project.Qt.QPA.QPlatformIntegrationFactoryInterface.5.3",
    "MetaData": {
        "Keys": [
            "xcb"
        ]
    },
    "archreq": 0,
    "className": "QXcbIntegrationPlugin",
    "debug": false,
    "version": 331008
}

Got keys from plugin meta data ("xcb")
Cannot load library /tds_scratch1/SYSADMIN/nesccmgmt/Gyorgy.Fekete/vapor/VAPOR3-3.5.0-Linux/lib/platforms/libqxcb.so: (libxkbcommon-x11.so.0: cannot open shared object file: No such file or directory)
QLibraryPrivate::loadPlugin failed on "/tds_scratch1/SYSADMIN/nesccmgmt/Gyorgy.Fekete/vapor/VAPOR3-3.5.0-Linux/lib/platforms/libqxcb.so" : "Cannot load library /tds_scratch1/SYSADMIN/nesccmgmt/Gyorgy.Fekete/vapor/VAPOR3-3.5.0-Linux/lib/platforms/libqxcb.so: (libxkbcommon-x11.so.0: cannot open shared object file: No such file or directory)"
qt.qpa.plugin: Could not load the Qt platform plugin "xcb" in "" even though it was found.
This application failed to start because no Qt platform plugin could be initialized. Reinstalling the application may fix this problem.

Available platform plugins are: xcb.

3. Is there a libxcb.so library installed on HERA?

lrwxrwxrwx 1 root root     15 Nov  9 16:00 /lib64/libxcb.so -> libxcb.so.1.1.0
lrwxrwxrwx 1 root root     15 Nov  9 15:59 /lib64/libxcb.so.1 -> libxcb.so.1.1.0
-rwxr-xr-x 1 root root 165976 Oct 30  2018 /lib64/libxcb.so.1.1.0

but ldd on vapor shows this:

$ ldd ./bin/vapor
    linux-vdso.so.1 =>  (0x00007ffd787da000)
    libc.so.6 => /lib64/libc.so.6 (0x00007fa815bf6000)
    /lib64/ld-linux-x86-64.so.2 (0x00007fa815fc4000)

Cheers, George

gyorgy-fekete commented 2 years ago

For what it's worth, here is an except from glxinfo:

GLX version: 1.2
GLX extensions:
    GLX_ARB_get_proc_address, GLX_ARB_multisample, GLX_EXT_import_context, 
    GLX_EXT_visual_info, GLX_EXT_visual_rating, GLX_SGIX_fbconfig
OpenGL vendor string: Mesa project: www.mesa3d.org
OpenGL renderer string: Mesa GLX Indirect
OpenGL version string: 1.2 (1.5 Mesa 6.4.2)

I have a suspicion that this version may be a wee bit long in the tooth, but I could be wrong. It is not possible to replace the system-wide GL libraries without a big review, but if you tell me that I can use an alternate bunch of GL libraries in a private prefix location, I can certainly build them.

Cheers, George

gyorgy-fekete commented 2 years ago

Hi Vapor Team! This is not directly related to the Hera cluster issue, but since I tried to install Vapor 3.5.0 on a machine under my control, free of firewall rules on a virtual Centos 7 box under Parallels, I ran into different problems.

When I startup vapor on the virtual disk, I do not get the xcb message I am getting on Hera; I get this:

$ ./VAPOR3-3.5.0-Linux/bin/vapor
qt.network.ssl: QSslSocket: cannot resolve OPENSSL_init_ssl
qt.network.ssl: QSslSocket: cannot resolve OPENSSL_init_crypto
qt.network.ssl: QSslSocket: cannot resolve ASN1_STRING_get0_data
qt.network.ssl: QSslSocket: cannot resolve EVP_CIPHER_CTX_reset
qt.network.ssl: QSslSocket: cannot resolve RSA_bits
qt.network.ssl: QSslSocket: cannot resolve OPENSSL_sk_new_null
qt.network.ssl: QSslSocket: cannot resolve OPENSSL_sk_push
qt.network.ssl: QSslSocket: cannot resolve OPENSSL_sk_free
qt.network.ssl: QSslSocket: cannot resolve OPENSSL_sk_num
qt.network.ssl: QSslSocket: cannot resolve OPENSSL_sk_pop_free
qt.network.ssl: QSslSocket: cannot resolve OPENSSL_sk_value
qt.network.ssl: QSslSocket: cannot resolve DH_get0_pqg
qt.network.ssl: QSslSocket: cannot resolve SSL_CTX_set_options
qt.network.ssl: QSslSocket: cannot resolve SSL_CTX_set_ciphersuites
qt.network.ssl: QSslSocket: cannot resolve SSL_set_psk_use_session_callback
qt.network.ssl: QSslSocket: cannot resolve SSL_get_client_random
qt.network.ssl: QSslSocket: cannot resolve SSL_SESSION_get_master_key
qt.network.ssl: QSslSocket: cannot resolve SSL_session_reused
qt.network.ssl: QSslSocket: cannot resolve SSL_set_options
qt.network.ssl: QSslSocket: cannot resolve TLS_method
qt.network.ssl: QSslSocket: cannot resolve TLS_client_method
qt.network.ssl: QSslSocket: cannot resolve TLS_server_method
qt.network.ssl: QSslSocket: cannot resolve X509_up_ref
qt.network.ssl: QSslSocket: cannot resolve X509_STORE_CTX_get0_chain
qt.network.ssl: QSslSocket: cannot resolve X509_getm_notBefore
qt.network.ssl: QSslSocket: cannot resolve X509_getm_notAfter
qt.network.ssl: QSslSocket: cannot resolve X509_get_version
qt.network.ssl: QSslSocket: cannot resolve X509_STORE_set_ex_data
qt.network.ssl: QSslSocket: cannot resolve X509_STORE_get_ex_data
qt.network.ssl: QSslSocket: cannot resolve OpenSSL_version_num
qt.network.ssl: QSslSocket: cannot resolve OpenSSL_version
qt.network.ssl: Incompatible version of OpenSSL
qt.network.ssl: QSslSocket::connectToHostEncrypted: TLS initialization failed
qt.network.ssl: QSslSocket::connectToHostEncrypted: TLS initialization failed

I think I can fix this with a few updates...

What I recommend and wish for is a section in the docs about "must haves" on the platform targeted for Vapor installation. I have not looked into the sources yet; the above experience is based on the centos shell archive (VAPOR3-3.5.0-CentOS7.sh)

If there is already a section on this, and I did not find it because I am lazy :) then I apologise

Best, George

sgpearse commented 2 years ago

One quick note:

What I recommend and wish for is a section in the docs about "must haves" on the platform targeted for Vapor installation. I have not looked into the sources yet; the above experience is based on the centos shell archive (VAPOR3-3.5.0-CentOS7.sh)

You have a special case that requires GL Forwarding. Last year we had a few exercises trying to get TurboVNC to work. There are other clients that do this like TigerVNC and MuVNC.

HERA 100% must have a GLX forwarding VNC server running, and the user needs to run the appropriate VNC client software too. You won't be able to use any visualization software on HERA without GL forwarding.

gyorgy-fekete commented 2 years ago

PROGRESS: I found out that my Centos box (not HERA) had OpenSSL 1.0.2k

I got 1.1.1c (not even the latest one) from openssl.org/source and built myself a new one.

That seems to have fixed the "qt.network.ssl: QSslSocket: cannot resolve OPENSSL_init_ssl" ... issue! (wooHoo!)

Now moving on to the GL thing... Is TurboVNC / VirtualGL the recommended coupling?

Cheers, George

sgpearse commented 2 years ago

That seems to have fixed the "qt.network.ssl: QSslSocket: cannot resolve OPENSSL_init_ssl" ... issue! (wooHoo!)

Thank you for reporting this. It's been a hard one to track because it's not completely reproducible on any given system. You're not the first to run into it. I'll make a new issue here that I think we'll probably address before our 3.6 release.

Now moving on to the GL thing... Is TurboVNC / VirtualGL the recommended coupling?

VirtualGL is imperative. No getting around that one.

TurboVNC is the client I've exclusively used for almost a decade. I've tried muVNC and TigerVNC but never became an adopter. TurboVNC keeps on surviving our HPC team's efforts to replace it. While I've never administrated it, I'd endorse it because it's reliably stood the test of time and it's free.

sgpearse commented 11 months ago

Close in case the libxcb issue arrises with currently supported operation systems.