podman exec into a "-it" container: container create failed (no logs from conmon): EOF

edsantiago commented 3 years ago

Common thread seems to be:

Running: podman [options] run -dti --name test1 quay.io/libpod/fedora-minimal:latest sleep +Inf
time="2021-06-16T19:33:53-05:00" level=warning msg="The input device is not a TTY. The --tty and --interactive flags might not work properly"
99e3b419a97aa408a4d0d3072bbd00579d5edd7c97790aa06d61f233cfdc1b4c
Running: podman [options] exec -ti test1 true
Running: podman [options] exec -ti test1 true       ! sometimes it fails on the first, sometimes on the third
Error: container create failed (no logs from conmon): EOF

Podman exec [It] podman exec terminal doesn't hang

fedora-34 : int podman fedora-34 root container
- PR #10214
- 05-05 10:20
fedora-34 : int podman fedora-34 root host
- PR #10860
- 07-05 09:08
- 07-05 09:08
- 07-05 09:08
- PR #10688
- 06-16 20:43
gce_instance:fedora : int podman fedora-33 root host
- PR #9820
- 03-25 18:36
- 03-25 18:36
- 03-25 18:36
- PR #9477
- 03-02 15:17
- 03-02 15:17
- 03-02 15:17
instance:GCEInstance : int podman fedora-33 root host
- PR #9969
- 04-09 14:52
- 04-09 14:52
- 04-09 14:52
ubuntu-2104 : int podman ubuntu-2104 rootless host
- PR #10631
- 06-10 04:18

And also just now in a still-live PR (my flake-xref does not handle live PRs): int podman ubuntu-2104 root host

Note: the March and April logs above have been garbagecollected, I can't confirm that the error is the same one. I'm leaving them in the report deliberately, in case it helps to have a timestamp for the start of this flake (i.e. it might not be new in June).

Edit: this is podman, not podman-remote, so it's unlikely to be the same as #7360

edsantiago commented 3 years ago

Podman exec [It] podman exec terminal doesn't hang

fedora-34 : int podman fedora-34 root host
- PR #11101
- 08-02 18:00
ubuntu-2104 : int podman ubuntu-2104 root host
- PR #10992
- 07-20 09:29

edsantiago commented 3 years ago

Podman exec [It] podman exec terminal doesn't hang

fedora-34 : int podman fedora-34 root host
- PR #11219
- 08-12 14:49
- 08-12 14:49

edsantiago commented 3 years ago

Hmmm, I wonder if this is the same problem, in a different test? Looks suspiciously close.

  podman network connect
Running: podman [options] exec -it test ip addr show eth1
Error: container create failed (no logs from conmon): EOF

Podman network connect and disconnect [It] podman network connect

fedora-34 : int podman fedora-34 root host
- PR #11215
- 08-12 12:33

edsantiago commented 3 years ago

Another one, in yet another test. Looks like this is happening more often than I thought, because it happens in multiple tests:

Podman exec [It] podman exec --detach

fedora-34 : int podman fedora-34 root host
- PR #11215
- 08-12 12:33

github-actions[bot] commented 3 years ago

A friendly reminder that this issue had no activity for 30 days.

edsantiago commented 3 years ago

Podman exec [It] podman exec terminal doesn't hang

fedora-34 : int podman fedora-34 root host
- PR #11556
- 09-14 12:36
- 09-14 12:36
- 09-14 12:36
fedora-34 : int remote fedora-34 root host [remote]
- PR #11606
- 09-16 15:05
ubuntu-2104 : int podman ubuntu-2104 root host
- PR #11655
- 09-20 09:14
- PR #11402
- 09-07 05:10
- 09-07 05:10
- 09-07 05:10

Podman network connect and disconnect [It] podman network connect when not running

ubuntu-2104 : int podman ubuntu-2104 root host
- PR #11655
- 09-20 09:14
- 09-20 09:14
- 09-20 09:14

edsantiago commented 2 years ago

Podman network connect and disconnect [It] podman network disconnect and run with network ID

fedora-34 : int podman fedora-34 root host
- PR #11834
- 10-01 16:37

edsantiago commented 2 years ago

Podman exec [It] podman exec terminal doesn't hang

fedora-34 : int podman fedora-34 root host
- PR #11609
- 09-24 11:48
fedora-34 : int remote fedora-34 root host [remote]
- PR #11957
- 10-17 22:43
ubuntu-2104 : int podman ubuntu-2104 root host
- PR #11655
- 09-20 09:14

edsantiago commented 2 years ago

Still seeing this. int remote fedora-35 root

vrothberg commented 2 years ago

I'll take a stab at it. Thanks for assembling the data, @edsantiago!

vrothberg commented 2 years ago

while true; do                                                                                        
        ./bin/podman run --name=test --replace -dti quay.io/libpod/fedora-minimal:latest sleep +Inf   
        ./bin/podman exec test true                                                                   
        ./bin/podman rm -f -t0 test                                                                   
done

Ran over 30 minutes but no failure. I'll have a look at the code; maybe I can come up with a theory but a reproducer would be great.

edsantiago commented 2 years ago

I can't reproduce on my laptop either, but on a 1minutetip f34 VM it fails instantly, on the very first try:

# podman run -dti --name=test quay.io/libpod/fedora-minimal:latest sleep 20;podman exec -it test true
8ed6f60c9a8e38d2081ece7a5471cc1a931f402170a9b0ff8f149bffb434994b
Error: container create failed (no logs from conmon): EOF

After that first time it still fails, but only once in 4-5 times. Note that it fails even without < /dev/null on either podman command.

podman-3.4.1-1.fc34.x86_64 conmon-2.0.30-2.fc34.x86_64

edsantiago commented 2 years ago

One more note: I think the -it is needed on exec. Without it, I can't reproduce the failure.

rhatdan commented 2 years ago

mheon PTAL

rhatdan commented 2 years ago

One would think this is a race between podman run creating the container and launching conmon, and podman exec gets to talk to conmon before it knows there is a container,causing some issues.

edsantiago commented 2 years ago

Well, except that it's not always the first exec. This log shows the first three execs working, then it fails on the fourth.

mheon commented 2 years ago

Very difficult to track this down without a repro - we need to know what's going on with Conmon such that it's blowing up (personally I think Conmon is probably either segfaulting or just printing the error to the journal and exiting without reporting the real error to Podman). Might be logs in the journal that will help us?

mheon commented 2 years ago

@rhatdan It's not actually container create that's failing, that's a bad error message. We're trying to make a Conmon for the exec session but Conmon is failing with no logs as to why.

edsantiago commented 2 years ago

@mheon see my 1minutetip f34 VM comment above. It reproduces reliably.

edsantiago commented 2 years ago

Here's one in the brand-new ubuntu-2110

edsantiago commented 2 years ago

Podman network connect and disconnect [It] podman network disconnect when not running

ubuntu-2110 : int podman ubuntu-2110 root host
- PR #12305
- 11-20 21:59
- 11-20 21:59
- 11-20 21:59
- PR #11795
- 11-18 14:46
ubuntu-2110 : int remote ubuntu-2110 root host [remote]
- PR #12281
- 11-16 09:43

Podman network connect and disconnect [It] podman network disconnect

fedora-34 : int remote fedora-34 root host [remote]
- PR #12348
- 11-18 12:25
ubuntu-2110 : int podman ubuntu-2110 root host
- PR #12305
- 11-20 21:59
ubuntu-2110 : int podman ubuntu-2110 rootless host
- PR #12256
- 11-11 14:59

edsantiago commented 2 years ago

Podman exec [It] podman exec terminal doesn't hang

ubuntu-2110 : int podman ubuntu-2110 root host
- PR #12449
- 11-30 09:31
- 11-30 09:31
- 11-30 09:31
ubuntu-2110 : int remote ubuntu-2110 root host [remote]
- PR #12301
- 11-21 20:38
- 11-21 20:38
- 11-21 20:38

Podman network connect and disconnect [It] podman network disconnect

fedora-34 : int remote fedora-34 root host [remote]
- PR #12380
- 11-23 04:26
ubuntu-2110 : int podman ubuntu-2110 root host
- PR #12305
- 11-20 21:59

edsantiago commented 2 years ago

Fresh one in ubuntu 2110 root. Curious thing: once it happens one time, it seems to happen on a bunch more tests afterward.

edsantiago commented 2 years ago

Here's one where it fails with bad exit code, but the conmon error isn't present:

# podman [options] run -dti --name test1 registry.fedoraproject.org/fedora-minimal:34 sleep +Inf
time="2021-12-08T15:27:00Z" level=warning msg="The input device is not a TTY. The --tty and --interactive flags might not work properly"
ce72bce58b4ef3d0215bc5d805594b94f8ae18e1eee558471358f6a682846df3
# podman [options] exec -ti test1 true
# podman [options] exec -ti test1 true       <--- this is the one that seems to fail
...
         ? Failure [4.220 seconds]
         Podman exec
         /var/tmp/go/src/github.com/containers/podman/test/e2e/exec_test.go:16
           podman exec terminal doesn't hang [It]
           /var/tmp/go/src/github.com/containers/podman/test/e2e/exec_test.go:334

           Expected
               <int>: 129
           to match exit code:
               <int>: 0

Podman exec [It] podman exec terminal doesn't hang

fedora-34 : int podman fedora-34 root container
- PR #12541
- 12-08 10:42

github-actions[bot] commented 2 years ago

A friendly reminder that this issue had no activity for 30 days.

rhatdan commented 2 years ago

@edsantiago is this still an issue?

edsantiago commented 2 years ago

Last seen 12-21:

Podman init containers [It] podman ensure always init containers always run

fedora-35 : int podman fedora-35 root host
- PR #12662
- 12-21 03:49
- 12-21 03:49
- 12-21 02:32

Podman network connect and disconnect [It] podman network connect and run with network ID

fedora-35 : int podman fedora-35 root host
- PR #12659
- 12-20 12:05
ubuntu-2104 : int remote ubuntu-2104 root host [remote]
- PR #12602
- 12-16 08:25
- PR #12592
- 12-15 10:13

Maybe Santa's elves fixed it over break. Or maybe our CI use has been low due to so many of us on PTO. (Since you removed the stale-issue tag, I'm pretty sure your guess is the same as mine).

github-actions[bot] commented 2 years ago

A friendly reminder that this issue had no activity for 30 days.

edsantiago commented 2 years ago

Podman exec [It] podman exec terminal doesn't hang

ubuntu-2110 : int podman ubuntu-2110 root host
- PR #12966
- 01-25 09:11

github-actions[bot] commented 2 years ago

A friendly reminder that this issue had no activity for 30 days.

github-actions[bot] commented 2 years ago

A friendly reminder that this issue had no activity for 30 days.

edsantiago commented 2 years ago

Still active. int remote ubuntu-2110 root on April 12. Here's the newly-improved logging, hope it helps:

         # podman-remote [options] exec -it test cat /etc/resolv.conf
         Error: container create failed (no logs from conmon): conmon bytes "": readObjectStart: expect { or n, but found , error found in #0 byte of ...||..., bigger context ...||...
         time="2022-04-12T07:40:52Z" level=error msg="error attaching to container 4e927437ce760730eb3f42a5ae787ce6395f1bd2fe91c7f60cd1efe885f2a050 exec session 7d0bfd1f752051b6bfd9ca9e250976bf9518aec9167d829c87bfbf8b4ee944bb: container create failed (no logs from conmon): conmon bytes \"\": readObjectStart: expect { or n, but found \x00, error found in #0 byte of ...||..., bigger context ...||..."
         output: Error: container create failed (no logs from conmon): conmon bytes "": readObjectStart: expect { or n, but found , error found in #0 byte of ...||..., bigger context ...||...

mgoltzsche commented 2 years ago

I am getting the same error using podman 4.1.1 with crun (*-minimal image) on an arm64 raspbian:

$ sudo podman run --privileged -u podman:podman mgoltzsche/podman:4.1.1-minimal podman run alpine:latest echo hello
Resolved "mgoltzsche/podman" as an alias (/var/cache/containers/short-name-aliases.conf)
Trying to pull docker.io/mgoltzsche/podman:4.1.1-minimal...
Getting image source signatures
Copying blob 37691f58d9e4 skipped: already exists  
Copying blob 1ffd8199415c skipped: already exists  
Copying blob 135bc9a38332 skipped: already exists  
Copying blob aa1665137377 skipped: already exists  
Copying blob 792d06fe73d3 skipped: already exists  
Copying blob f7cfdf956c91 skipped: already exists  
Copying blob 579d79910bba skipped: already exists  
Copying blob 122acc346952 skipped: already exists  
Copying blob e6381dee05c9 skipped: already exists  
Copying blob 276ea4b7fe87 skipped: already exists  
Copying blob a189e2ed98fc skipped: already exists  
Copying blob 9981e73032c8 skipped: already exists  
Copying config 4e5c5c0967 done  
Writing manifest to image destination
Storing signatures
WARN[0003] Failed to add conmon to cgroupfs sandbox cgroup: error creating cgroup for memory: mkdir /sys/fs/cgroup/memory: read-only file system 
time="2022-06-23T21:08:43Z" level=warning msg="\"/\" is not a shared mount, this could cause issues or missing mounts with rootless containers"
Resolving "alpine" using unqualified-search registries (/etc/containers/registries.conf)
Trying to pull docker.io/library/alpine:latest...
Getting image source signatures
Copying blob sha256:b3c136eddcbf2003d3180787cef00f39d46b9fd9e4623178282ad6a8d63ad3b0
Copying blob sha256:b3c136eddcbf2003d3180787cef00f39d46b9fd9e4623178282ad6a8d63ad3b0
Copying config sha256:6e30ab57aeeef1ebca8ac5a6ea05b5dd39d54990be94e7be18bb969a02d10a3f
Writing manifest to image destination
Storing signatures
time="2022-06-23T21:08:47Z" level=error msg="Removing container 49951053911c1938d3b5785f3b18b825a1994c8ca62950f233f44920ca554289 from runtime after creation failed"
Error: container create failed (no logs from conmon): conmon bytes "": readObjectStart: expect { or n, but found , error found in #0 byte of ...||..., bigger context ...||...

Though in my case it seems to be caused by the fact that the crun binary is broken (my bad: the arm64 build is broken apparently):

$ sudo podman run --privileged -u podman:podman mgoltzsche/podman:4.1.1-minimal podman info
...
time="2022-06-23T21:15:14Z" level=error msg="Getting info on OCI runtime crun: error getting version of OCI runtime crun: `/usr/local/bin/crun --version` failed:   (fork/exec /usr/local/bin/crun: exec format error)"
...

@edsantiago you could also inspect your podman info output since it may indicate the cause of the error (that is apparently silently skipped when running a podman container).

edsantiago commented 2 years ago

@mgoltzsche thanks for the followup. It looks like your situation is a hard failure; the bug reported here is a flake, i.e., it fails unpredictably on a system that otherwise works fine. Hope you're able to get a working crun.

edsantiago commented 2 years ago

Here's a fresh one:

# podman [options] exec -ti test1 true
Error: container create failed (no logs from conmon): conmon bytes "": readObjectStart: expect { or n, but found , error found in #0 byte of ...||..., bigger context ...||...

Podman exec [It] podman exec terminal doesn't hang

fedora-36 : int podman fedora-36 root host
- PR #14400
- 06-28 09:47

github-actions[bot] commented 2 years ago

A friendly reminder that this issue had no activity for 30 days.

github-actions[bot] commented 1 year ago

A friendly reminder that this issue had no activity for 30 days.

edsantiago commented 1 year ago

Seen just now in ubuntu 2110, on a v4.2.0-rhel backport. This is the first instance (in my logs) since June.

edsantiago commented 1 year ago

Seen this week, f36 rootless:

$ podman [options] exec -it test cat /etc/resolv.conf
Error: container create failed (no logs from conmon): 
    conmon bytes "": readObjectStart: expect { or n, but found , error found in #0 byte of ...||..., bigger context ...||...

This is a PR into main, not some ancient side branch.

@Luap99 you instrumented the code (#13637) to emit a better error message. Can you PTAL and see if this gives you anything to go on?

Luap99 commented 1 year ago

conmon bytes ""

It gets absolutely nothing back, is it possible that conmon crashes without podman realizing it? Do we have access to the system journal?

edsantiago commented 1 year ago

Yes.

Instructions for reference:

Click the link to annotated logs
- Press Home key to go to top
- Click the Task link, this takes you to Cirrus
  - Click the red accordion, to hide the error log, it's useless here
  - Scroll down a little, to Run journal, click that. It's an accordion, will expand.

Good luck.

Luap99 commented 1 year ago

I don't see anything specific in there. The only way I see this can happen is when the syncpipe fd is closed (either conmon just closes it or it crashes)

edsantiago commented 1 year ago

Here is a complete log -- I discovered too late that the one in Cirrus is incomplete.

The error is logged at 560s, and the starting time (at the top of the error log) is 12:22:19, making 12:31:39 our upper bound (log times will be askew due to buffering). I found 12:31:39 in the log and scrolled up, and see a lot of red/orange errors/warnings, but all of them seem to be auth-related: the podman-login tests must be running at the same time. (If your log isn't colorized, you might want my greasemonkey helper). I scrolled all the way up and could not find any red/orange that seemed applicable to this error.

Searching in-page for 'conmon ' (conmon, space) I see a handful of errors: one oom_adj, one container exited, but all are well after what I think is our upper bound.

I'm out of ideas, but maybe you can download the log and apply some grep or other filters to it. HTH. And sorry for pointing to the truncated log.

Luap99 commented 1 year ago

Yeah I already got that and looked at full log. It is hard to parse because tests run in parallel and podman exec unlike podman run does not log the output in the journal + we do not have to container id for the exec session that we would find in the logs.

edsantiago commented 1 year ago

New one

fedora-36 : int podman fedora-36 rootless host
- PR #16465
- 11-09 16:41
- 11-09 16:41
- 11-09 16:41

Link to full journal.log, and this time I see a conmon crash:

Nov 09 15:38:54 cirrus-task-6141700557504512 conmon[203653]: conmon d8dee386c836bedcb099 <ndebug>: container PID: 203656
Nov 09 15:38:54 cirrus-task-6141700557504512 conmon[203674]: conmon 6752f6b1ecc96e5dca5a <ndebug>: failed to write to /proc/self/oom_score_adj: Permission denied
Nov 09 15:38:54 cirrus-task-6141700557504512 systemd[3624]: Started libpod-conmon-6752f6b1ecc96e5dca5a50731b7b1a31a69c678ba67824db7b2c3cd578ed1371.scope.
Nov 09 15:38:54 cirrus-task-6141700557504512 conmon[203675]: conmon 6752f6b1ecc96e5dca5a <ninfo>: addr{sun_family=AF_UNIX, sun_path=/proc/self/fd/12/attach}
Nov 09 15:38:54 cirrus-task-6141700557504512 conmon[203675]: conmon 6752f6b1ecc96e5dca5a <ninfo>: terminal_ctrl_fd: 12
Nov 09 15:38:54 cirrus-task-6141700557504512 conmon[203675]: conmon 6752f6b1ecc96e5dca5a <ninfo>: winsz read side: 15, winsz write side: 15
Nov 09 15:38:54 cirrus-task-6141700557504512 systemd[3624]: Started libpod-6752f6b1ecc96e5dca5a50731b7b1a31a69c678ba67824db7b2c3cd578ed1371.scope - libcrun container.
Nov 09 15:38:54 cirrus-task-6141700557504512 conmon[203675]: conmon 6752f6b1ecc96e5dca5a <ndebug>: container PID: 203678
....

Nov 09 15:38:55 cirrus-task-6141700557504512 systemd[3624]: Started podman-203718.scope.
Nov 09 15:38:55 cirrus-task-6141700557504512 conmon[203675]: conmon 6752f6b1ecc96e5dca5a <ninfo>: container 203678 exited with status 137
.....
Nov 09 15:38:55 cirrus-task-6141700557504512 conmon[203653]: conmon d8dee386c836bedcb099 <ninfo>: container 203656 exited with status 137

edsantiago commented 1 year ago

Seen in f37 root. System tests, so it should theoretically be a linear flow, but I see no useful conmon entries in the journal log.

# podman exec -it 2acb8108c6998af777a294791f0d5d6f7522b07203a74cfef7d6298d57d0fc8d cat /sys/fs/cgroup/io.max
/usr/lib/bats-core/test_functions.bash: line 259: warning: command substitution: ignored null byte in input
Error: container create failed (no logs from conmon): conmon bytes "": readObjectStart: expect { or n, but found , error found in #0 byte of ...||..., bigger context ...||...
[ rc=255 (** EXPECTED 0 **) ]

edsantiago commented 1 year ago

Seen yesterday in int f37 remote root:

*   podman make sure init container runs before pod containers
...
# podman-remote [options] exec -it 9fb507a8d4a19fce7edc66f75b52b7158a79d6607e48af07e5de6204d8707734 cat /dev/shm/vfWRJSEwjuoA
Error: container create failed (no logs from conmon): conmon bytes "": readObjectStart: expect { or n, but found , error found in #0 byte of ...||..., bigger context ...||...
time="2023-01-10T08:24:44-06:00" level=error msg="attaching to container 9fb507a8d4a19fce7edc66f75b52b7158a79d6607e48af07e5de6204d8707734 exec session ad224f3b5b29a5b03754c50aa2fd35204880e9af88fb574d7f585c26ee4ce174: container create failed (no logs from conmon): conmon bytes \"\": readObjectStart: expect { or n, but found \x00, error found in #0 byte of ...||..., bigger context ...||..."

edsantiago commented 1 year ago

f37 rootless, with sqlite

* podman make sure once container is removed
...
$ podman [options] --db-backend sqlite --storage-driver vfs exec -it 006d464fe96e67248686ecb76188815dd40e7aecd3f3d8d528d5d21ad4c2624e cat /dev/shm/neZCOyWxoITQ
Error: container create failed (no logs from conmon): conmon bytes "": readObjectStart: expect { or n, but found , error found in #0 byte of ...||..., bigger context ...||...

edsantiago commented 1 year ago

f36 rootless, sqlite

edsantiago commented 1 year ago

And another f37 rootless sqlite. Seeing this one a lot in sqlite logs; but it could just be availability bias because I'm downloading full logs.

containers / podman