abrt / retrace-server

Application for remote coredump analysis
GNU General Public License v2.0
40 stars 30 forks source link

Retracing vmcores in Podman fails #423

Open mgrabovsky opened 3 years ago

mgrabovsky commented 3 years ago

Interactively retracing vmcores in Podman fails with the message

crash: /cores/retrace/repos/kernel/x86_64/usr/lib/debug/lib/modules/4.18.0-80.1.2.el8_0.x86_64/vmlinux: No such file or directory

Usage:

  crash [OPTION]... NAMELIST MEMORY-IMAGE[@ADDRESS] (dumpfile form)
  crash [OPTION]... [NAMELIST]                  (live system form)

Enter "crash -h" for details.

Reported by @DaveWysochanskiRH.

DaveWysochanskiRH commented 2 years ago

There are a number of issues. I tried this again and ran into other issues, I no longer see the above error with vmlinux though.

  1. Need rootless podman. This seems to be somewhat a mess but I think this works with latest upstream.

  2. Copying vmcore into container is a problem due to size of vmcores (often 100GB or more).

  3. Need both local users and ldap users to be able to use podman rootless. There's some issues with this depending on which version but I think this is fixed in upstream (see https://bugzilla.redhat.com/show_bug.cgi?id=2092629 and related bugs such as https://bugzilla.redhat.com/show_bug.cgi?id=2063750 and https://bugzilla.redhat.com/show_bug.cgi?id=2068088)

  4. Container storage should be setup for non-NFS use (see /etc/containers/storage.conf

    • Temporary Fix: Changed "graphroot" and "rootless_storage_path" variables to point at a local filesystem directory, manually fixed permissions, with "rootless_storage_path" defined with "$USER" directory path component
  5. Issues with 'AuthGroup' where tasks would fail with the following error in the retrace_log

    [2022-10-11 10:50:01] [E] Task failed: Unable to build podman container: time="2022-10-11T10:50:01-04:00" level=error msg="running `/usr/bin/newuidmap 1155992 0 174 1 1 231072 65536`: newuidmap: Target process 123456 is owned by a different user: uid:111 pw_uid:111 st_uid:111, gid:5555 pw_gid:111 st_gid:5555\n"
    • Temporary fix: Change "AuthGroup" value from an LDAP group back to local group "retrace" (uid/gid == 111/111)
  6. Could not find base container image to build the container

    • Temporary fix: This patch fixed it for me:

      
      @@ -922,7 +922,7 @@ class RetraceWorker:
      
           try:
               with (savedir / RetraceTask.CONTAINERFILE).open("w") as cntfile:
    • cntfile.write(f"FROM {distribution}:{version}\n\n")
    • cntfile.write(f"FROM ubi{version}/ubi\n\n") cntfile.write("RUN dnf " f"--releasever={version} " "--assumeyes "
  7. Could not obtain kernel-debuginfo package

    • This patched fixed it for me:
      
      @@ -931,7 +931,7 @@ class RetraceWorker:
                                 "shadow-utils && dnf clean all\n")
                   cntfile.write("RUN dnf "
                                 "--assumeyes "
    • "--enablerepo=debuginfo "
    • f"--enablerepo={distribution}-{version}-for-$(uname -m)-baseos-debug-rpms " "install kernel-debuginfo\n\n") cntfile.write("RUN useradd --no-create-home --no-log-init retrace\n") cntfile.write("RUN mkdir --parents /var/spool/abrt/crash\n\n")

After all that I still get this in the log:

[2022-10-13 04:19:04] [E] time="2022-10-13T04:19:04-04:00" level=warning msg="The input device is not a TTY. The --tty and --interactive flags might not work properly"
DaveWysochanskiRH commented 2 years ago
  1. Copying vmcore into container is a problem due to size of vmcores (often 100GB or more).

I don't think we need to copy the vmcore but we can use "-v" to bind mount in the vmcore and vmlinux files and any needed paths.

DaveWysochanskiRH commented 2 years ago
  1. Issues with 'AuthGroup' where tasks would fail with the following error in the retrace_log [2022-10-11 10:50:01] [E] Task failed: Unable to build podman container: time="2022-10-11T10:50:01-04:00" level=error msg="running /usr/bin/newuidmap 1155992 0 174 1 1 231072 65536: newuidmap: Target process 123456 is owned by a different user: uid:111 pw_uid:111 st_uid:111, gid:5555 pw_gid:111 st_gid:5555\n"

I had AuthGroup set in /etc/retrace-server/retrace-server.conf so that is why I got the above error. I needed to update the 'retrace' group in /etc/password as follows and this fixed the above. I wonder if that should be a standard procedure for installs when AuthGroup is used?

# usermod -g my-auth-group retrace
# systemctl restart httpd
DaveWysochanskiRH commented 2 years ago
  1. Issues with 'AuthGroup' where tasks would fail with the following error in the retrace_log [2022-10-11 10:50:01] [E] Task failed: Unable to build podman container: time="2022-10-11T10:50:01-04:00" level=error msg="running /usr/bin/newuidmap 1155992 0 174 1 1 231072 65536: newuidmap: Target process 123456 is owned by a different user: uid:111 pw_uid:111 st_uid:111, gid:5555 pw_gid:111 st_gid:5555\n"

I had AuthGroup set in /etc/retrace-server/retrace-server.conf so that is why I got the above error. I needed to update the 'retrace' group in /etc/password as follows and this fixed the above. I wonder if that should be a standard procedure for installs when AuthGroup is used?

# usermod -g my-auth-group retrace
# systemctl restart httpd

After the above I'm getting the following error:

[2022-10-17 13:11:52] [E] Task failed: Unable to build podman container: Error: failed to mount overlay for metacopy check with "" options: permission denied