pkg/machine/e2e: improve timeout handling

Luap99 commented 3 months ago

There seems to be a very bad flake where podman machine init takes to long. It doesn't seem to be deadlocked it just has to copy the entire image which is super slow in CI for some reason.

As such increase the timeout by 1m to 5m total. Hopefully this is good enough for now. Second in case we still hit a timeout send SIGABRT which makes go dump the stack trace sow e know if it seems to be doing something or hangs on lock.

Does this PR introduce a user-facing change?

None

openshift-ci[bot] commented 3 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Luap99

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/containers/podman/blob/main/OWNERS)~~ [Luap99] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment

rhatdan commented 3 months ago

LGTM

Luap99 commented 3 months ago

@edsantiago @baude @ashley-cui PTAL

Let's see how this goes, I am not sure why all of the sudden the flake (https://github.com/containers/podman/issues/22551) is so much worse (maybe https://github.com/containers/podman/pull/23068?) but from my limited debugging it really looks like it is slow to copy the VM file on init and there is no hang/lock or anything. So let's hope the extra minute is enough.

ashley-cui commented 3 months ago

LGTM. Wonder why the other pr made it worse..

Luap99 commented 3 months ago

LGTM. Wonder why the other pr made it worse..

yeah I don't understand it either, maybe they are located on different disks/volumes which makes copy slower or something like that?

edsantiago commented 3 months ago

Doesn't seem to help

Luap99 commented 3 months ago

Yeah really strange..., maybe not the slow IO after all. I have another idea that I can try.

Luap99 commented 3 months ago

/hold

Luap99 commented 3 months ago

// Interrupts the bar goroutine. It's important that
// bar.Abort(false) is called before p.Wait(), otherwise
// can hang.

Found this command in the progress bar handling code, I pushed a commit to drop the progress bar on decompress. No idea if this is actually the problem but worth a try.

edsantiago commented 3 months ago

Sigh. No luck.

Luap99 commented 3 months ago

Sigh. No luck.

Looks like it worked though? No mac flake, seems to have failed on linux on start with an entirely different stack trace.

Of course I might just be lucky on this run so I will repush again and see if it passes again.

Luap99 commented 3 months ago

<+     > # # podman events --since=the dawn of time...ish
         # #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
         # #| FAIL: exit code is 0; expected 125
         # #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
         # # [teardown]
         #

That is a new one to me, https://api.cirrus-ci.com/v1/artifact/task/6224808554266624/html/sys-podman-fedora-39-root-host-boltdb.log.html

But looks like all machine tests passed on the first try so that is good.

edsantiago commented 3 months ago

That looks ... impossible? The --filter code is deterministic, isn't it?

Luap99 commented 3 months ago

That looks ... impossible? The --filter code is deterministic, isn't it?

Until I saw this I would have said yes, now not so much... Anyhow feel free to file a new flake report for that. It shouldn't be to difficult to follow the code there.

For now I focus on the machine madness and this looks good. I will repush one last time to be sure and even lower the timeout a bit.

edsantiago commented 3 months ago

I've reproduced it (the --since bug). Am gathering data, will file an issue.

baude commented 3 months ago

im ok merging but would like us to considerer fixing the root issue with the progress bar so we have a consistent user experience.

/lgtm

ashley-cui commented 3 months ago

LGTM

mtrmac commented 3 months ago

(drive-by:) If it is a hang in the progress bar code, is it possible to collect a backtrace?

Luap99 commented 3 months ago

(drive-by:) If it is a hang in the progress bar code, is it possible to collect a backtrace?

sure I have one, scroll below the failure https://api.cirrus-ci.com/v1/artifact/task/5471682549252096/html/machine-mac-podman-darwin-rootless-host-sqlite.log.html#t--podman-machine-rm-Remove-machine--1

It was not obvious to me where it hangs there exactly, the main routine seems to be stucked in syscall.Write(...) so I originally assumed it hangs writing the file but you can also see the mbp progress goroutines at the bottom so I simply tried to removed them and that worked now two times without flake.

mtrmac commented 3 months ago

Thanks!

/vendor/github.com/vbauerster/mpb/v8/heap_manager.go:47 is a service goroutine waiting for a next command, probably irrelevant
/vendor/github.com/vbauerster/mpb/v8/progress.go:272 is the progress bar goroutine still live, that would indeed block the p.Wait()
/pkg/machine/compression/sparse_file_writer.go:133 is the copy “really” happening, it didn’t yet reach the point of blocking at p.Wait(). The code is finalizing the write of the machine image, which ends with an (unknown number) of zero bytes; it might have done a seek to create a final sparse hole, and now it is writing the very last bytes, that’s at most 1 kB of zeroes — AFAICT it’s exactly 1 zero byte.

I don’t think the progress bar is (directly?) at fault here — it is “legitimately” waiting for the copy to finish. (Unless, I guess, the bar.ProxyReader were returning some wildly weird data, but that’s fairly simple code; or maybe it changes the code path / data flow in io.Copy. Anyway…)

… It seems vanishingly unlikely to me that the backtrace would hit a 1-byte write, after writing all of the image … unless the destination of the decompression were on a filesystem that didn’t actually support sparse files, causing that “1-byte” write to write many megabytes/gigabytes of data. Might that be the case? Or, wild unlikely guess, is the destination a memory / tmpfs filesystem trashing because it is running out of memory?

Luap99 commented 3 months ago

Well nevermind then, looks like I was just lucky two times.... It now failed again the same way so it is not related to the progress bar I guess.

I don't know what the underlying filesystem on our macos VMs is but AFAIK it is some network filesystem, maybe @cevich knows how this works?

Luap99 commented 3 months ago

… unless the destination of the decompression were on a filesystem that didn’t actually support sparse files, causing that “1-byte” write to write many megabytes/gigabytes of data. Might that be the case?

That sounds the most likely scenario to me. But then again other commands in the log are "fast", i.e. there are tests that run podman machine init in under 15s. So why does it copy so slow only sometimes? And I had the timeout up to 5 mins without success.

Luap99 commented 3 months ago

/hold

There does not seem to be a point merging this as it does not fix the flake.

ashley-cui commented 3 months ago

FWIW I think I've hit some type of pull timeout pulling directly from a URL. Unsure if it's related, but maybe worth bringing up?

On a mac, this hangs for me, but downloading the image manually then pulling from the downloaded file does not.

podman machine init --image https://fedorapeople.org/groups/podman/testing/podman-machine-daily.aarch64.applehv.raw.zst

mtrmac commented 3 months ago

But then again other commands in the log are "fast", i.e. there are tests that run podman machine init in under 15s. So why does it copy so slow only sometimes?

FWIW historically, we’ve had trouble on cloud systems, where a VM has allocated some number of I/O operations (per second? per a time?), and things were going fast until the VM hit a limit, and suddenly things were much slower. But 15s vs. 5m does sound way too extreme.

cevich commented 3 months ago

FWIW historically, we’ve had trouble on cloud systems, where a VM has allocated some number of I/O operations (per second? per a time?), and things were going fast until the VM hit a limit, and suddenly things were much slower. But 15s vs. 5m does sound way too extreme.

The MacOS systems are an exception. They're using the (really fast) local SSD instead of SAN storage. In fact, it's more likely the networking bandwidth that's the limitation here since that's restricted by policy.

Tossing this out as an idea: What if the tests always used a local image? It shouldn't be too difficult for me to craft a cron(?) job (on each Mac) that simply curls the latest image and (atomicly) sticks it into a directory with a uniform name. Then if pulling ever fails, the tests can at least (maybe?) continue with the previously curled image.

ashley-cui commented 3 months ago

Tossing this out as an idea: What if the tests always used a local image? It shouldn't be too difficult for me to craft a cron(?) job (on each Mac) that simply curls the latest image and (atomicly) sticks it into a directory with a uniform name. Then if pulling ever fails, the tests can at least (maybe?) continue with the previously curled image.

I don't think its the pull from the network that's failing, as the pull from the network is only done once, at the beginning of the suite. The issue I think is the pull from that cached image that doesn't finish copying..

cevich commented 3 months ago

pull from that cached image that doesn't finish copying.

That's curious. Any chance we're running out of space or something silly like that? Can we make the test or podman more noisy about the exact error?

Edit: Update, just looked on the Macs, doesn't appear to be. Smallest available device is /System/Volumes/VM with ~78g on every system I looked at. But this problem afflicts Windows too, no? :thinking:

Luap99 commented 3 months ago

It is not related to networking, as @ashley-cui said it just extracts a local file.

The hang here in the stack trace is very consistent. It is always in sparseWriter.Close() which writes remaining 0 zeros to the file. And the trace shows it being in the write syscall. Which means I find @mtrmac theory the most likely that the last write causes the file system to actually write all zeros out instead of making a sparse file. That said if the CI systems are using a local SSD storage that should be more than fast enough to write the file even in not sparse mode in 3/4/5 mins that were tried. And we see other tests doing the same machine init complete with less than 15s so we can assume the IO more than fast enough for our use cases.

If we would run out of space I would assume write to return an error an not block forever?

Luap99 commented 3 months ago

 time="2024-07-03T09:10:20Z" level=error msg="sw.pendingZeroes: 6341788068"
  time="2024-07-03T09:10:20Z" level=error msg="sw.pendingZeroes after hole: 1"

These are the sizes in Close() and consistently for all command working and hanging (which is expected at the decompression must be deterministic). As such I would assume it is not related to the sparseWriter code but rather something broken in the darwin kernel/fs that is blocking the write.

And 6341788068 bytes is about 6.3 Gb so even if the fs cannot do sparse files a SSD should be able to write 6GB in less than a minute, certainly faster then our 3, 4 or 5 mins timeout that always failed.

Luap99 commented 3 months ago

   6:                APFS Volume ci_tempdir              8.3 GB     disk3s7
   7:                APFS Volume ci_homedir              4.9 GB     disk3s8

All the images are using the tmpdir. @cevich just to rule out the space theory can we increase the volume sizes for the worker pool?

cevich commented 3 months ago

All the images are using the tmpdir. @cevich just to rule out the space theory can we increase the volume sizes for the worker pool?

My (maybe incorrect) understanding is all the volumes share space and should expand as needed. Perhaps this is where things are breaking? Either way, I can look into it. What's the path where the file is being written?

Edit: Ya, volumes share space of their "container". I'll still go in see what I can see.

Luap99 commented 3 months ago

All the files are written under the tmpdir (/private/tmp/ci/...)

cevich commented 3 months ago

For ref. this is what I see currently on one of the macs:

MacM1-3:~ ec2-user$ sudo diskutil list
/dev/disk0 (internal, physical):
   #:                       TYPE NAME                    SIZE       IDENTIFIER
   0:      GUID_partition_scheme                        *251.0 GB   disk0
   1:             Apple_APFS_ISC Container disk1         524.3 MB   disk0s1
   2:                 Apple_APFS Container disk3         245.1 GB   disk0s2
   3:        Apple_APFS_Recovery Container disk2         5.4 GB     disk0s3

/dev/disk3 (synthesized):
   #:                       TYPE NAME                    SIZE       IDENTIFIER
   0:      APFS Container Scheme -                      +245.1 GB   disk3
                                 Physical Store disk0s2
   1:                APFS Volume InternalDisk            10.3 GB    disk3s1
   2:                APFS Volume Preboot                 5.4 GB     disk3s2
   3:                APFS Volume Recovery                923.1 MB   disk3s3
   4:                APFS Volume Data                    768.6 MB   disk3s5
   5:                APFS Volume VM                      20.5 KB    disk3s6
   6:                APFS Volume ci_tempdir              9.2 MB     disk3s7
   7:                APFS Volume ci_homedir              1.1 GB     disk3s8

/dev/disk4 (external, physical):
   #:                       TYPE NAME                    SIZE       IDENTIFIER
   0:      GUID_partition_scheme                        *107.4 GB   disk4
   1:                        EFI EFI                     209.7 MB   disk4s1
   2:                 Apple_APFS Container disk5         107.2 GB   disk4s2

/dev/disk5 (synthesized):
   #:                       TYPE NAME                    SIZE       IDENTIFIER
   0:      APFS Container Scheme -                      +107.2 GB   disk5
                                 Physical Store disk4s2
   1:                APFS Volume Macintosh HD - Data     5.4 GB     disk5s1
   2:                APFS Volume Macintosh HD            9.9 GB     disk5s2
   3:              APFS Snapshot com.apple.os.update-... 9.9 GB     disk5s2s1
   4:                APFS Volume Preboot                 5.6 GB     disk5s3
   5:                APFS Volume Recovery                820.5 MB   disk5s4
   6:                APFS Volume VM                      1.1 GB     disk5s5

MacM1-3:~ ec2-user$ mount
/dev/disk5s2s1 on / (apfs, sealed, local, read-only, journaled)
devfs on /dev (devfs, local, nobrowse)
/dev/disk5s5 on /System/Volumes/VM (apfs, local, noexec, journaled, nobrowse)
/dev/disk5s3 on /System/Volumes/Preboot (apfs, local, journaled, nobrowse)
/dev/disk1s2 on /System/Volumes/xarts (apfs, local, noexec, journaled, nobrowse)
/dev/disk1s1 on /System/Volumes/iSCPreboot (apfs, local, journaled, nobrowse)
/dev/disk1s3 on /System/Volumes/Hardware (apfs, local, journaled, nobrowse)
/dev/disk5s1 on /System/Volumes/Data (apfs, local, journaled, nobrowse)
map auto_home on /System/Volumes/Data/home (autofs, automounted, nobrowse)
/dev/disk3s4 on /private/tmp/tmp-mount-c74ANs (apfs, local, journaled, nobrowse)
/dev/disk3s7 on /private/tmp/ci (apfs, local, journaled, protect)
/dev/disk3s8 on /Users/MacM1-3-worker (apfs, local, journaled, protect)

Luap99 commented 3 months ago

Ok so MacOS uses dynamic volume sizes?! Maybe that has a bug somehow? Can we preallocate like 20+ GB for the volume?

cevich commented 3 months ago

Maybe that has a bug somehow? Can we preallocate like 20+ GB for the volume?

Yes, this was exactly my thinking. Or maybe it causes a weird delay/syscall behavior. Remember it's BSD not Linux :wink: I'll keep searching.

mtrmac commented 3 months ago

If we would run out of space I would assume write to return an error an not block forever?

(At the end-user level, macOS uses quite a bit of disk space (frequently gigabytes) of “purgeable” storage. I don’t quite know how that works — running low on space tends to trigger attempts to delete some of that “purgeable” storage, but IIRC it is also quite possible to hit an “out of space” error when writing too fast. I really have no idea how much of that is happening at the UX “document” abstraction vs. the low-level filesystem syscall API, i.e. whether a write(2) can block while purging is happening. But all of this should only be applicable if we are close enough to running out.)

Ok so MacOS uses dynamic volume sizes?!

My reading of https://developer.apple.com/support/downloads/Apple-File-System-Reference.pdf is that there is a single “allocation” layer per the whole container, with each volume allocating storage for a file/directory as necessary. (But there does exist a feature to reserve space for a volume, or to impose a quota on a volume.)

Luap99 commented 3 months ago

(But there does exist a feature to reserve space for a volume, or to impose a quota on a volume.)

yes I found that too, see https://github.com/containers/automation/pull/204 I guess only one way to find out if this helps here.

Luap99 commented 3 months ago

Also FYI I did some disk speed tests, 10g (not sparsed) file was written in like 12s at the beginning of the test and 20g took 25s so I really don't think IO speed is the problem.

Luap99 commented 2 months ago

I bumped the timeout up to 10 mins and it passed on first try, I will rerun several times to check

podman inspect stop
/Users/MacM1-2-worker/ci/task-6339901111664640[/pkg/machine/e2e/inspect_test.go:13](https://github.com/containers/podman/blob/39cfa349cc8fddb7c655a7ca542b58945cf85cff/pkg/machine/e2e/inspect_test.go#L13)
  inspect two machines
  /Users/MacM1-2-worker/ci/task-6339901111664640[/pkg/machine/e2e/inspect_test.go:23](https://github.com/containers/podman/blob/39cfa349cc8fddb7c655a7ca542b58945cf85cff/pkg/machine/e2e/inspect_test.go#L23)
  → Enter [BeforeEach] TOP-LEVEL - /Users/MacM1-2-worker/ci/task-6339901111664640[/pkg/machine/e2e/machine_test.go:160](https://github.com/containers/podman/blob/39cfa349cc8fddb7c655a7ca542b58945cf85cff/pkg/machine/e2e/machine_test.go#L160) @ 07/05/24 11:18:54.419
  ← Exit  [BeforeEach] TOP-LEVEL - /Users/MacM1-2-worker/ci/task-6339901111664640[/pkg/machine/e2e/machine_test.go:160](https://github.com/containers/podman/blob/39cfa349cc8fddb7c655a7ca542b58945cf85cff/pkg/machine/e2e/machine_test.go#L160) @ 07/05/24 11:18:54.42 (0s)
  → Enter [It] inspect two machines - /Users/MacM1-2-worker/ci/task-6339901111664640[/pkg/machine/e2e/inspect_test.go:23](https://github.com/containers/podman/blob/39cfa349cc8fddb7c655a7ca542b58945cf85cff/pkg/machine/e2e/inspect_test.go#L23) @ 07/05/24 11:18:54.42
  /Users/MacM1-2-worker/ci/task-6339901111664640/bin/darwin/podman machine init --disk-size 11 --image /private/tmp/ci/podman-machine-daily.aarch64.applehv.raw foo1
  Machine init complete
  To start your machine run:

    podman machine start foo1

  /Users/MacM1-2-worker/ci/task-6339901111664640/bin/darwin/podman machine init --disk-size 11 --image /private/tmp/ci/podman-machine-daily.aarch64.applehv.raw foo2
  Machine init complete
  To start your machine run:

    podman machine start foo2

  /Users/MacM1-2-worker/ci/task-6339901111664640/bin/darwin/podman machine inspect --format {{.Name}} foo1
  foo1
  ← Exit  [It] inspect two machines - /Users/MacM1-2-worker/ci/task-6339901111664640[/pkg/machine/e2e/inspect_test.go:23](https://github.com/containers/podman/blob/39cfa349cc8fddb7c655a7ca542b58945cf85cff/pkg/machine/e2e/inspect_test.go#L23) @ 07/05/24 11:26:08.767 (7m14.196s)
  > Enter [DeferCleanup (Each)] podman inspect stop - /Users/MacM1-2-worker/ci/task-6339901111664640[/pkg/machine/e2e/config_init_test.go:83](https://github.com/containers/podman/blob/39cfa349cc8fddb7c655a7ca542b58945cf85cff/pkg/machine/e2e/config_init_test.go#L83) @ 07/05/24 11:26:08.767
  /Users/MacM1-2-worker/ci/task-6339901111664640/bin/darwin/podman machine rm --force foo2
  < Exit [DeferCleanup (Each)] podman inspect stop - /Users/MacM1-2-worker/ci/task-6339901111664640[/pkg/machine/e2e/config_init_test.go:83](https://github.com/containers/podman/blob/39cfa349cc8fddb7c655a7ca542b58945cf85cff/pkg/machine/e2e/config_init_test.go#L83) @ 07/05/24 11:26:08.853 (86ms)
  > Enter [DeferCleanup (Each)] podman inspect stop - /Users/MacM1-2-worker/ci/task-6339901111664640[/pkg/machine/e2e/config_init_test.go:83](https://github.com/containers/podman/blob/39cfa349cc8fddb7c655a7ca542b58945cf85cff/pkg/machine/e2e/config_init_test.go#L83) @ 07/05/24 11:26:08.853
  /Users/MacM1-2-worker/ci/task-6339901111664640/bin/darwin/podman machine rm --force foo1
  < Exit [DeferCleanup (Each)] podman inspect stop - /Users/MacM1-2-worker/ci/task-6339901111664640[/pkg/machine/e2e/config_init_test.go:83](https://github.com/containers/podman/blob/39cfa349cc8fddb7c655a7ca542b58945cf85cff/pkg/machine/e2e/config_init_test.go#L83) @ 07/05/24 11:26:08.935 (82ms)
  > Enter [DeferCleanup (Each)] TOP-LEVEL - /Users/MacM1-2-worker/ci/task-6339901111664640[/pkg/machine/e2e/machine_test.go:162](https://github.com/containers/podman/blob/39cfa349cc8fddb7c655a7ca542b58945cf85cff/pkg/machine/e2e/machine_test.go#L162) @ 07/05/24 11:26:08.935
  < Exit [DeferCleanup (Each)] TOP-LEVEL - /Users/MacM1-2-worker/ci/task-6339901111664640[/pkg/machine/e2e/machine_test.go:162](https://github.com/containers/podman/blob/39cfa349cc8fddb7c655a7ca542b58945cf85cff/pkg/machine/e2e/machine_test.go#L162) @ 07/05/24 11:26:08.936 (1ms)
• [434.366 seconds]

It took over 7 mins for this one test to run, all others are much faster.

Luap99 commented 2 months ago

I think this is good to review, I had several runs without flakes now. There are still some debug command in here to see the file sizes but that should not hurt us and makes finding patterns easier so I like to keep them for now.

So the fix for now is to bump the timeout to 10 minutes as somehow it magically takes like around 6-7m sometimes???

So fat the pattern I can see is when it is super slow the file does not seem sparse, it is always one of the early test (often the first one). Some tests write sparse files with different sizes (I would have assumed it be deterministic but looks like it is not?!) Even tests that do not write the file sparse still write fast enough, it is just sometimes where it starts to hang.

  [du -ah /private/tmp/ci/podman_test1707448420/.local/share/containers/podman/machine/applehv]
0   /private/tmp/ci/podman_test1707448420/.local/share/containers/podman/machine/applehv/cache
8.1G    /private/tmp/ci/podman_test1707448420/.local/share/containers/podman/machine/applehv/3f6176645637-arm64.raw
8.1G    /private/tmp/ci/podman_test1707448420/.local/share/containers/podman/machine/applehv
  [ls -lh /private/tmp/ci/podman_test1707448420/.local/share/containers/podman/machine/applehv]
total 8.1G
-rw-r--r-- 1 MacM1-3-worker staff 11G Jul  5 14:24 3f6176645637-arm64.raw
drwxr-xr-x 2 MacM1-3-worker staff  64 Jul  5 14:17 cache
  [stat /private/tmp/ci/podman_test1707448420/.local/share/containers/podman/machine/applehv/3f6176645637-arm64.raw]
  File: /private/tmp/ci/podman_test1707448420/.local/share/containers/podman/machine/applehv/3f6176645637-arm64.raw
  Size: 11811160064 Blocks: 16914816   IO Block: 4096   regular file
Device: 1,29    Inode: 28584       Links: 1
Access: (0644/-rw-r--r--)  Uid: (  502/MacM1-3-worker)   Gid: (   20/   staff)
Access: 2024-07-05 14:17:08.611603530 +0000
Modify: 2024-07-05 14:24:07.695944514 +0000
Change: 2024-07-05 14:24:07.695944514 +0000
 Birth: 2024-07-05 14:17:08.611603530 +0000
...
 [413.586 seconds]

mheon commented 2 months ago

Neat trick with the func() string using SIGABRT. LGTM

ashley-cui commented 2 months ago

LGTM, nice work

cevich commented 2 months ago

/lgtm

cevich commented 2 months ago

/hold cancel

containers / podman

pkg/machine/e2e: improve timeout handling #23162

Does this PR introduce a user-facing change?