checkpoint-restore / criu

Checkpoint/Restore tool
criu.org
Other
2.91k stars 583 forks source link

Criu check fails on NI Linux RT #1864

Open ImreSzebelledi opened 2 years ago

ImreSzebelledi commented 2 years ago

Description

Hello! I am trying to get criu work on a linux distribution released by National instrument on an industrial controller (NI-3172) but I am experiencing difficulties. I have built the Kernel with the needed flags, but there seems to be some problem when running criu check when running criu even with root privileges. Thank you for the help in advance!

Steps to reproduce the issue:

  1. Build https://github.com/ni/linux.git -- branch: nilrt/21.8/5.15
  2. Install criu with dependencies
  3. Run criu check

Output of `criu --version`:

``` Version: 3.12 ```

Output of `criu check --all`:

``` Error (criu/libnetlink.c:55): -95 reported by netlink: Operation not supported Error (criu/net.c:3225): Unable to create a veth pair: -95 Warn (criu/net.c:3247): NSID isn't reported for network links Error (criu/arch/x86/kerndat.c:189): Continue after SIGSTOP.. Urr what? [4]+ Stopped criu check --all ```

Additional environment details:

Kernel version: Linux NI-IC-3172-01E67A74 5.15.26-rt34-g2360492e22b4 #1 SMP PREEMPT_RT Tue May 3 11:22:29 CEST 2022 x86_64 GNU/Linux

adrianreber commented 2 years ago

This is indeed an unusual error. Can you try CRIU 3.16? Not sure that helps, but maybe.

Not sure if CRIU works on the RT kernel. Can you run criu check --all -v4? Can you also try without RT?

mihalicyn commented 2 years ago

He-he. Very interesting... failure similar to static/bridge fail from https://github.com/checkpoint-restore/criu/pull/1862

mihalicyn commented 2 years ago

@ImreSzebelledi can you post your kernel build config and lsmod output?

ImreSzebelledi commented 2 years ago

Thank you very much for the suggestions!

Unfortunately I am having trouble installing version 3.16 because of various reasons (don't have proper package manager on this distro (only opkg), so I got 3.12 working by copying the proper prebuilt files from Centos7. I have tried to do the same from Ubuntu22 for 3.16 but I got various errors bacause it got dependecies for Glibc which are not available for the NI linux RT. At this point I gave up copying and instead started trying to build 3.16 from source on the NI linux RT but I still having difficulties building protobuf beforehand on it... sigh. Despite all these things I have attached the kernel config and the lsmod output kernelconfig.txt .

lsmod:

admin@NI-IC-3172-01E67A74:~# lsmod Module Size Used by tmp421 16384 0 g_ether 16384 0 u_ether 24576 1 g_ether libcomposite 61440 1 g_ether udc_core 61440 2 u_ether,libcomposite hid_logitech_hidpp 40960 0 mousedev 20480 0 ipv6 454656 25 hid_logitech_dj 28672 0 x86_pkg_temp_thermal 16384 0 coretemp 16384 0 aesni_intel 376832 0 i2c_i801 28672 0 crypto_simd 16384 1 aesni_intel i915 1978368 5 i2c_smbus 16384 1 i2c_i801 intel_gtt 20480 1 i915 lpc_ich 28672 0 mfd_core 16384 1 lpc_ich drm_kms_helper 253952 1 i915 syscopyarea 16384 1 drm_kms_helper sysfillrect 16384 1 drm_kms_helper sysimgblt 16384 1 drm_kms_helper fb_sys_fops 16384 1 drm_kms_helper ttm 65536 1 i915 agpgart 36864 1 ttm video 49152 1 i915 drm 471040 7 drm_kms_helper,i915,ttm backlight 20480 4 video,drm_kms_helper,i915,drm button 16384 0 igb 176128 0 e1000e 184320 0 i2c_algo_bit 16384 2 igb,i915 admin@NI-IC-3172-01E67A74:~#

mihalicyn commented 2 years ago

# CONFIG_VETH is not set

This is direct reason for:

Error (criu/libnetlink.c:55): -95 reported by netlink: Operation not supported
Error (criu/net.c:3225): Unable to create a veth pair: -95
Warn  (criu/net.c:3247): NSID isn't reported for network links

VETH support is not required but I recommend to compile it as a module. It's a small, well tested and fully safe module.

Error (criu/arch/x86/kerndat.c:189): Continue after SIGSTOP.. Urr what?

this is really strange:

static int kdat_x86_has_ptrace_fpu_xsave_bug_child(void *arg)
{
    if (ptrace(PTRACE_TRACEME, 0, 0, 0)) {
        pr_perror("%d: ptrace(PTRACE_TRACEME) failed", getpid());
        _exit(1);
    }

    if (kill(getpid(), SIGSTOP))
        pr_perror("%d: failed to kill myself", getpid());

    pr_err("Continue after SIGSTOP.. Urr what?\n");
    _exit(1);
}

Looks like a real kernel issue (likely related to CONFIG_PREEMPT_RT=y). Speaking honestly, this particular error don't prevents CRIU from work. It's better to try to create minimal reproducer for that and report an issue to https://github.com/ni/linux.git kernel maintainers.

Possible minimal reproducer:

#define _GNU_SOURCE
#include <linux/sched.h>    /* Definition of struct clone_args */
#include <sched.h>          /* Definition of CLONE_* constants */
#include <sys/syscall.h>    /* Definition of SYS_* constants */
#include <sched.h>
#include <signal.h>
#include <stdlib.h>
#include <sys/wait.h>
#include <sys/ptrace.h>
#include <sys/types.h>
#include <sys/uio.h>
#include <unistd.h>
#include <stdio.h>
#include <error.h>

#define ARRAY_SIZE(x)       (sizeof(x) / sizeof((x)[0]))
#define PAGE_SIZE 4096

static int bug_child(void *arg)
{
    if (ptrace(PTRACE_TRACEME, 0, 0, 0)) {
        printf("%d: ptrace(PTRACE_TRACEME) failed\n", getpid());
        _exit(1);
    }

    if (kill(getpid(), SIGSTOP))
        printf("%d: failed to kill myself\n", getpid());

    printf("Continue after SIGSTOP.. Urr what?\n");
    _exit(1);
}

int main(int argc, char **argv)
{
    char stack[PAGE_SIZE];
    int flags = CLONE_VM | CLONE_FILES | CLONE_UNTRACED | SIGCHLD;
    int ret = -1;
    pid_t child;
    int stat;

    child = clone(bug_child, stack + ARRAY_SIZE(stack), flags, 0);
    if (child < 0) {
        printf("%s(): failed to clone()", __func__);
        return -1;
    }

    if (waitpid(child, &stat, WUNTRACED) != child) {
        /*
         * waitpid() may end with ECHILD if SIGCHLD == SIG_IGN,
         * and the child has stopped already.
         */
        printf("Failed to wait for %s() test", __func__);
        goto out_kill;
    }

    if (!WIFSTOPPED(stat)) {
        printf("Born child is unstoppable! (might be dead)\n");
        goto out_kill;
    }

    ret = 0;

out_kill:
    if (kill(child, SIGKILL))
        printf("Failed to kill my own child");
    if (waitpid(child, &stat, 0) < 0)
        printf("Failed wait for a dead child");

    return ret;
}

Try to compile this as a separate program by gcc -o checkme checkme.c and run ./checkme.

ImreSzebelledi commented 2 years ago

I had the time today to build the kernel with CONFIG_VETH=y and also managed to build criu 3.16.1-on it. As you have predicted all the VETH related error vent away, but the SIGSTOP error persisted. Unfortunately I won't have acces to the hardware during the weekend to try the reproducer but will do so on Monday. Thank you so much for all the help!

github-actions[bot] commented 2 years ago

A friendly reminder that this issue had no activity for 30 days.