Closed Luap99 closed 2 months ago
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: Luap99
The full list of commands accepted by this bot can be found here.
The pull request process is described here
LGTM
@edsantiago @baude @ashley-cui PTAL
Let's see how this goes, I am not sure why all of the sudden the flake (https://github.com/containers/podman/issues/22551) is so much worse (maybe https://github.com/containers/podman/pull/23068?) but from my limited debugging it really looks like it is slow to copy the VM file on init and there is no hang/lock or anything. So let's hope the extra minute is enough.
LGTM. Wonder why the other pr made it worse..
LGTM. Wonder why the other pr made it worse..
yeah I don't understand it either, maybe they are located on different disks/volumes which makes copy slower or something like that?
Yeah really strange..., maybe not the slow IO after all. I have another idea that I can try.
/hold
// Interrupts the bar goroutine. It's important that
// bar.Abort(false) is called before p.Wait(), otherwise
// can hang.
Found this command in the progress bar handling code, I pushed a commit to drop the progress bar on decompress. No idea if this is actually the problem but worth a try.
Sigh. No luck.
Sigh. No luck.
Looks like it worked though? No mac flake, seems to have failed on linux on start with an entirely different stack trace.
Of course I might just be lucky on this run so I will repush again and see if it passes again.
<+ > # # podman events --since=the dawn of time...ish
# #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
# #| FAIL: exit code is 0; expected 125
# #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# # [teardown]
#
That is a new one to me, https://api.cirrus-ci.com/v1/artifact/task/6224808554266624/html/sys-podman-fedora-39-root-host-boltdb.log.html
But looks like all machine tests passed on the first try so that is good.
That looks ... impossible? The --filter
code is deterministic, isn't it?
That looks ... impossible? The --filter code is deterministic, isn't it?
Until I saw this I would have said yes, now not so much... Anyhow feel free to file a new flake report for that. It shouldn't be to difficult to follow the code there.
For now I focus on the machine madness and this looks good. I will repush one last time to be sure and even lower the timeout a bit.
I've reproduced it (the --since
bug). Am gathering data, will file an issue.
im ok merging but would like us to considerer fixing the root issue with the progress bar so we have a consistent user experience.
/lgtm
LGTM
(drive-by:) If it is a hang in the progress bar code, is it possible to collect a backtrace?
(drive-by:) If it is a hang in the progress bar code, is it possible to collect a backtrace?
sure I have one, scroll below the failure https://api.cirrus-ci.com/v1/artifact/task/5471682549252096/html/machine-mac-podman-darwin-rootless-host-sqlite.log.html#t--podman-machine-rm-Remove-machine--1
It was not obvious to me where it hangs there exactly, the main routine seems to be stucked in syscall.Write(...)
so I originally assumed it hangs writing the file but you can also see the mbp progress goroutines at the bottom so I simply tried to removed them and that worked now two times without flake.
Thanks!
p.Wait()
p.Wait()
. The code is finalizing the write of the machine image, which ends with an (unknown number) of zero bytes; it might have done a seek to create a final sparse hole, and now it is writing the very last bytes, that’s at most 1 kB of zeroes — AFAICT it’s exactly 1 zero byte.I don’t think the progress bar is (directly?) at fault here — it is “legitimately” waiting for the copy to finish. (Unless, I guess, the bar.ProxyReader
were returning some wildly weird data, but that’s fairly simple code; or maybe it changes the code path / data flow in io.Copy
. Anyway…)
… It seems vanishingly unlikely to me that the backtrace would hit a 1-byte write, after writing all of the image … unless the destination of the decompression were on a filesystem that didn’t actually support sparse files, causing that “1-byte” write to write many megabytes/gigabytes of data. Might that be the case? Or, wild unlikely guess, is the destination a memory / tmpfs filesystem trashing because it is running out of memory?
Well nevermind then, looks like I was just lucky two times.... It now failed again the same way so it is not related to the progress bar I guess.
I don't know what the underlying filesystem on our macos VMs is but AFAIK it is some network filesystem, maybe @cevich knows how this works?
… unless the destination of the decompression were on a filesystem that didn’t actually support sparse files, causing that “1-byte” write to write many megabytes/gigabytes of data. Might that be the case?
That sounds the most likely scenario to me. But then again other commands in the log are "fast", i.e. there are tests that run podman machine init in under 15s. So why does it copy so slow only sometimes? And I had the timeout up to 5 mins without success.
/hold
There does not seem to be a point merging this as it does not fix the flake.
FWIW I think I've hit some type of pull timeout pulling directly from a URL. Unsure if it's related, but maybe worth bringing up?
On a mac, this hangs for me, but downloading the image manually then pulling from the downloaded file does not.
podman machine init --image https://fedorapeople.org/groups/podman/testing/podman-machine-daily.aarch64.applehv.raw.zst
But then again other commands in the log are "fast", i.e. there are tests that run podman machine init in under 15s. So why does it copy so slow only sometimes?
FWIW historically, we’ve had trouble on cloud systems, where a VM has allocated some number of I/O operations (per second? per a time?), and things were going fast until the VM hit a limit, and suddenly things were much slower. But 15s vs. 5m does sound way too extreme.
FWIW historically, we’ve had trouble on cloud systems, where a VM has allocated some number of I/O operations (per second? per a time?), and things were going fast until the VM hit a limit, and suddenly things were much slower. But 15s vs. 5m does sound way too extreme.
The MacOS systems are an exception. They're using the (really fast) local SSD instead of SAN storage. In fact, it's more likely the networking bandwidth that's the limitation here since that's restricted by policy.
Tossing this out as an idea: What if the tests always used a local image? It shouldn't be too difficult for me to craft a cron(?) job (on each Mac) that simply curls the latest image and (atomicly) sticks it into a directory with a uniform name. Then if pulling ever fails, the tests can at least (maybe?) continue with the previously curled image.
Tossing this out as an idea: What if the tests always used a local image? It shouldn't be too difficult for me to craft a cron(?) job (on each Mac) that simply curls the latest image and (atomicly) sticks it into a directory with a uniform name. Then if pulling ever fails, the tests can at least (maybe?) continue with the previously curled image.
I don't think its the pull from the network that's failing, as the pull from the network is only done once, at the beginning of the suite. The issue I think is the pull from that cached image that doesn't finish copying..
pull from that cached image that doesn't finish copying.
That's curious. Any chance we're running out of space or something silly like that? Can we make the test or podman more noisy about the exact error?
Edit: Update, just looked on the Macs, doesn't appear to be. Smallest available device is /System/Volumes/VM
with ~78g on every system I looked at. But this problem afflicts Windows too, no? :thinking:
It is not related to networking, as @ashley-cui said it just extracts a local file.
The hang here in the stack trace is very consistent. It is always in sparseWriter.Close() which writes remaining 0 zeros to the file. And the trace shows it being in the write syscall. Which means I find @mtrmac theory the most likely that the last write causes the file system to actually write all zeros out instead of making a sparse file. That said if the CI systems are using a local SSD storage that should be more than fast enough to write the file even in not sparse mode in 3/4/5 mins that were tried. And we see other tests doing the same machine init complete with less than 15s so we can assume the IO more than fast enough for our use cases.
If we would run out of space I would assume write to return an error an not block forever?
time="2024-07-03T09:10:20Z" level=error msg="sw.pendingZeroes: 6341788068"
time="2024-07-03T09:10:20Z" level=error msg="sw.pendingZeroes after hole: 1"
These are the sizes in Close() and consistently for all command working and hanging (which is expected at the decompression must be deterministic). As such I would assume it is not related to the sparseWriter code but rather something broken in the darwin kernel/fs that is blocking the write.
And 6341788068 bytes is about 6.3 Gb so even if the fs cannot do sparse files a SSD should be able to write 6GB in less than a minute, certainly faster then our 3, 4 or 5 mins timeout that always failed.
6: APFS Volume ci_tempdir 8.3 GB disk3s7
7: APFS Volume ci_homedir 4.9 GB disk3s8
All the images are using the tmpdir. @cevich just to rule out the space theory can we increase the volume sizes for the worker pool?
All the images are using the tmpdir. @cevich just to rule out the space theory can we increase the volume sizes for the worker pool?
My (maybe incorrect) understanding is all the volumes share space and should expand as needed. Perhaps this is where things are breaking? Either way, I can look into it. What's the path where the file is being written?
Edit: Ya, volumes share space of their "container". I'll still go in see what I can see.
All the files are written under the tmpdir (/private/tmp/ci/...
)
For ref. this is what I see currently on one of the macs:
MacM1-3:~ ec2-user$ sudo diskutil list
/dev/disk0 (internal, physical):
#: TYPE NAME SIZE IDENTIFIER
0: GUID_partition_scheme *251.0 GB disk0
1: Apple_APFS_ISC Container disk1 524.3 MB disk0s1
2: Apple_APFS Container disk3 245.1 GB disk0s2
3: Apple_APFS_Recovery Container disk2 5.4 GB disk0s3
/dev/disk3 (synthesized):
#: TYPE NAME SIZE IDENTIFIER
0: APFS Container Scheme - +245.1 GB disk3
Physical Store disk0s2
1: APFS Volume InternalDisk 10.3 GB disk3s1
2: APFS Volume Preboot 5.4 GB disk3s2
3: APFS Volume Recovery 923.1 MB disk3s3
4: APFS Volume Data 768.6 MB disk3s5
5: APFS Volume VM 20.5 KB disk3s6
6: APFS Volume ci_tempdir 9.2 MB disk3s7
7: APFS Volume ci_homedir 1.1 GB disk3s8
/dev/disk4 (external, physical):
#: TYPE NAME SIZE IDENTIFIER
0: GUID_partition_scheme *107.4 GB disk4
1: EFI EFI 209.7 MB disk4s1
2: Apple_APFS Container disk5 107.2 GB disk4s2
/dev/disk5 (synthesized):
#: TYPE NAME SIZE IDENTIFIER
0: APFS Container Scheme - +107.2 GB disk5
Physical Store disk4s2
1: APFS Volume Macintosh HD - Data 5.4 GB disk5s1
2: APFS Volume Macintosh HD 9.9 GB disk5s2
3: APFS Snapshot com.apple.os.update-... 9.9 GB disk5s2s1
4: APFS Volume Preboot 5.6 GB disk5s3
5: APFS Volume Recovery 820.5 MB disk5s4
6: APFS Volume VM 1.1 GB disk5s5
MacM1-3:~ ec2-user$ mount
/dev/disk5s2s1 on / (apfs, sealed, local, read-only, journaled)
devfs on /dev (devfs, local, nobrowse)
/dev/disk5s5 on /System/Volumes/VM (apfs, local, noexec, journaled, nobrowse)
/dev/disk5s3 on /System/Volumes/Preboot (apfs, local, journaled, nobrowse)
/dev/disk1s2 on /System/Volumes/xarts (apfs, local, noexec, journaled, nobrowse)
/dev/disk1s1 on /System/Volumes/iSCPreboot (apfs, local, journaled, nobrowse)
/dev/disk1s3 on /System/Volumes/Hardware (apfs, local, journaled, nobrowse)
/dev/disk5s1 on /System/Volumes/Data (apfs, local, journaled, nobrowse)
map auto_home on /System/Volumes/Data/home (autofs, automounted, nobrowse)
/dev/disk3s4 on /private/tmp/tmp-mount-c74ANs (apfs, local, journaled, nobrowse)
/dev/disk3s7 on /private/tmp/ci (apfs, local, journaled, protect)
/dev/disk3s8 on /Users/MacM1-3-worker (apfs, local, journaled, protect)
Ok so MacOS uses dynamic volume sizes?! Maybe that has a bug somehow? Can we preallocate like 20+ GB for the volume?
Maybe that has a bug somehow? Can we preallocate like 20+ GB for the volume?
Yes, this was exactly my thinking. Or maybe it causes a weird delay/syscall behavior. Remember it's BSD not Linux :wink: I'll keep searching.
If we would run out of space I would assume write to return an error an not block forever?
(At the end-user level, macOS uses quite a bit of disk space (frequently gigabytes) of “purgeable” storage. I don’t quite know how that works — running low on space tends to trigger attempts to delete some of that “purgeable” storage, but IIRC it is also quite possible to hit an “out of space” error when writing too fast. I really have no idea how much of that is happening at the UX “document” abstraction vs. the low-level filesystem syscall API, i.e. whether a write(2)
can block while purging is happening. But all of this should only be applicable if we are close enough to running out.)
Ok so MacOS uses dynamic volume sizes?!
My reading of https://developer.apple.com/support/downloads/Apple-File-System-Reference.pdf is that there is a single “allocation” layer per the whole container, with each volume allocating storage for a file/directory as necessary. (But there does exist a feature to reserve space for a volume, or to impose a quota on a volume.)
(But there does exist a feature to reserve space for a volume, or to impose a quota on a volume.)
yes I found that too, see https://github.com/containers/automation/pull/204 I guess only one way to find out if this helps here.
Also FYI I did some disk speed tests, 10g (not sparsed) file was written in like 12s at the beginning of the test and 20g took 25s so I really don't think IO speed is the problem.
I bumped the timeout up to 10 mins and it passed on first try, I will rerun several times to check
podman inspect stop
/Users/MacM1-2-worker/ci/task-6339901111664640[/pkg/machine/e2e/inspect_test.go:13](https://github.com/containers/podman/blob/39cfa349cc8fddb7c655a7ca542b58945cf85cff/pkg/machine/e2e/inspect_test.go#L13)
inspect two machines
/Users/MacM1-2-worker/ci/task-6339901111664640[/pkg/machine/e2e/inspect_test.go:23](https://github.com/containers/podman/blob/39cfa349cc8fddb7c655a7ca542b58945cf85cff/pkg/machine/e2e/inspect_test.go#L23)
→ Enter [BeforeEach] TOP-LEVEL - /Users/MacM1-2-worker/ci/task-6339901111664640[/pkg/machine/e2e/machine_test.go:160](https://github.com/containers/podman/blob/39cfa349cc8fddb7c655a7ca542b58945cf85cff/pkg/machine/e2e/machine_test.go#L160) @ 07/05/24 11:18:54.419
← Exit [BeforeEach] TOP-LEVEL - /Users/MacM1-2-worker/ci/task-6339901111664640[/pkg/machine/e2e/machine_test.go:160](https://github.com/containers/podman/blob/39cfa349cc8fddb7c655a7ca542b58945cf85cff/pkg/machine/e2e/machine_test.go#L160) @ 07/05/24 11:18:54.42 (0s)
→ Enter [It] inspect two machines - /Users/MacM1-2-worker/ci/task-6339901111664640[/pkg/machine/e2e/inspect_test.go:23](https://github.com/containers/podman/blob/39cfa349cc8fddb7c655a7ca542b58945cf85cff/pkg/machine/e2e/inspect_test.go#L23) @ 07/05/24 11:18:54.42
/Users/MacM1-2-worker/ci/task-6339901111664640/bin/darwin/podman machine init --disk-size 11 --image /private/tmp/ci/podman-machine-daily.aarch64.applehv.raw foo1
Machine init complete
To start your machine run:
podman machine start foo1
/Users/MacM1-2-worker/ci/task-6339901111664640/bin/darwin/podman machine init --disk-size 11 --image /private/tmp/ci/podman-machine-daily.aarch64.applehv.raw foo2
Machine init complete
To start your machine run:
podman machine start foo2
/Users/MacM1-2-worker/ci/task-6339901111664640/bin/darwin/podman machine inspect --format {{.Name}} foo1
foo1
← Exit [It] inspect two machines - /Users/MacM1-2-worker/ci/task-6339901111664640[/pkg/machine/e2e/inspect_test.go:23](https://github.com/containers/podman/blob/39cfa349cc8fddb7c655a7ca542b58945cf85cff/pkg/machine/e2e/inspect_test.go#L23) @ 07/05/24 11:26:08.767 (7m14.196s)
> Enter [DeferCleanup (Each)] podman inspect stop - /Users/MacM1-2-worker/ci/task-6339901111664640[/pkg/machine/e2e/config_init_test.go:83](https://github.com/containers/podman/blob/39cfa349cc8fddb7c655a7ca542b58945cf85cff/pkg/machine/e2e/config_init_test.go#L83) @ 07/05/24 11:26:08.767
/Users/MacM1-2-worker/ci/task-6339901111664640/bin/darwin/podman machine rm --force foo2
< Exit [DeferCleanup (Each)] podman inspect stop - /Users/MacM1-2-worker/ci/task-6339901111664640[/pkg/machine/e2e/config_init_test.go:83](https://github.com/containers/podman/blob/39cfa349cc8fddb7c655a7ca542b58945cf85cff/pkg/machine/e2e/config_init_test.go#L83) @ 07/05/24 11:26:08.853 (86ms)
> Enter [DeferCleanup (Each)] podman inspect stop - /Users/MacM1-2-worker/ci/task-6339901111664640[/pkg/machine/e2e/config_init_test.go:83](https://github.com/containers/podman/blob/39cfa349cc8fddb7c655a7ca542b58945cf85cff/pkg/machine/e2e/config_init_test.go#L83) @ 07/05/24 11:26:08.853
/Users/MacM1-2-worker/ci/task-6339901111664640/bin/darwin/podman machine rm --force foo1
< Exit [DeferCleanup (Each)] podman inspect stop - /Users/MacM1-2-worker/ci/task-6339901111664640[/pkg/machine/e2e/config_init_test.go:83](https://github.com/containers/podman/blob/39cfa349cc8fddb7c655a7ca542b58945cf85cff/pkg/machine/e2e/config_init_test.go#L83) @ 07/05/24 11:26:08.935 (82ms)
> Enter [DeferCleanup (Each)] TOP-LEVEL - /Users/MacM1-2-worker/ci/task-6339901111664640[/pkg/machine/e2e/machine_test.go:162](https://github.com/containers/podman/blob/39cfa349cc8fddb7c655a7ca542b58945cf85cff/pkg/machine/e2e/machine_test.go#L162) @ 07/05/24 11:26:08.935
< Exit [DeferCleanup (Each)] TOP-LEVEL - /Users/MacM1-2-worker/ci/task-6339901111664640[/pkg/machine/e2e/machine_test.go:162](https://github.com/containers/podman/blob/39cfa349cc8fddb7c655a7ca542b58945cf85cff/pkg/machine/e2e/machine_test.go#L162) @ 07/05/24 11:26:08.936 (1ms)
• [434.366 seconds]
It took over 7 mins for this one test to run, all others are much faster.
I think this is good to review, I had several runs without flakes now. There are still some debug command in here to see the file sizes but that should not hurt us and makes finding patterns easier so I like to keep them for now.
So the fix for now is to bump the timeout to 10 minutes as somehow it magically takes like around 6-7m sometimes???
So fat the pattern I can see is when it is super slow the file does not seem sparse, it is always one of the early test (often the first one). Some tests write sparse files with different sizes (I would have assumed it be deterministic but looks like it is not?!) Even tests that do not write the file sparse still write fast enough, it is just sometimes where it starts to hang.
[du -ah /private/tmp/ci/podman_test1707448420/.local/share/containers/podman/machine/applehv]
0 /private/tmp/ci/podman_test1707448420/.local/share/containers/podman/machine/applehv/cache
8.1G /private/tmp/ci/podman_test1707448420/.local/share/containers/podman/machine/applehv/3f6176645637-arm64.raw
8.1G /private/tmp/ci/podman_test1707448420/.local/share/containers/podman/machine/applehv
[ls -lh /private/tmp/ci/podman_test1707448420/.local/share/containers/podman/machine/applehv]
total 8.1G
-rw-r--r-- 1 MacM1-3-worker staff 11G Jul 5 14:24 3f6176645637-arm64.raw
drwxr-xr-x 2 MacM1-3-worker staff 64 Jul 5 14:17 cache
[stat /private/tmp/ci/podman_test1707448420/.local/share/containers/podman/machine/applehv/3f6176645637-arm64.raw]
File: /private/tmp/ci/podman_test1707448420/.local/share/containers/podman/machine/applehv/3f6176645637-arm64.raw
Size: 11811160064 Blocks: 16914816 IO Block: 4096 regular file
Device: 1,29 Inode: 28584 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 502/MacM1-3-worker) Gid: ( 20/ staff)
Access: 2024-07-05 14:17:08.611603530 +0000
Modify: 2024-07-05 14:24:07.695944514 +0000
Change: 2024-07-05 14:24:07.695944514 +0000
Birth: 2024-07-05 14:17:08.611603530 +0000
...
[413.586 seconds]
Neat trick with the func() string using SIGABRT. LGTM
LGTM, nice work
/lgtm
/hold cancel
There seems to be a very bad flake where podman machine init takes to long. It doesn't seem to be deadlocked it just has to copy the entire image which is super slow in CI for some reason.
As such increase the timeout by 1m to 5m total. Hopefully this is good enough for now. Second in case we still hit a timeout send SIGABRT which makes go dump the stack trace sow e know if it seems to be doing something or hangs on lock.
Does this PR introduce a user-facing change?