actions / runner-images

GitHub Actions runner images
MIT License
9.31k stars 2.89k forks source link

XCode simulators failing when booting or using openurl #7971

Closed fpkamp closed 5 months ago

fpkamp commented 10 months ago

Description

Hi, I have been redirected from GitHub Support to describe our use case and perhaps influence future performance of your runner for macOS. We have a use case where we need to use XCode 14.3.1 and XCode 15.0 on the macOS13 runner. We create simulators with iPhones and navigate to a url and then we dispose the simulator. All of the above works pretty well locally, but it fails when executing in GitHub Actions. The behavior between XCode versions is different, with XCode 14 timing out on booting or navigating, however XCode 15 fails on other functions (perhaps an effect of changes to XCode itself) like binding launchd_sim. An example failure message I get would be:

An error was encountered processing the command (domain=NSPOSIXErrorDomain, code=60):
Unable to boot the Simulator.
launchd failed to respond.
Underlying error (domain=com.apple.SimLaunchHostService.RequestError, code=4):
    Failed to start launchd_sim: could not bind to session, launchd_sim may have crashed or quit responding

In the ticket that has originally redirected me here, Arthur says he was successful with adding a cache clean (~/Library/Caches/ and ~/Library/Developer/CoreSimulator/Caches/) and waiting 60-120 seconds, however upon trying that I observed that it only works for a single simulator and subsequent simulators would fail (at least using a rinse and repeat approach).

Platforms affected

Runner images affected

Image version and build link

version: 20230611.2 workflow run: https://github.com/fingerprintjs/fingerprintjs-pro/actions/runs/5646633166/job/15295123030

Is it regression?

no

Expected behavior

Simulators work smoothly and boot / open urls without crashing.

Actual behavior

Attempts to boot simulators and navigate to a URL fail very frequently.

Repro steps

Use the following script in a macOS 13 runner workflow: open /Applications/Xcode_14.3.1.app/Contents/Developer/Applications/Simulator.app/ phone1=$(/Applications/Xcode_14.3.1.app/Contents/Developer/usr/bin/simctl create iPhone-hzso25lt6h9 com.apple.CoreSimulator.SimDeviceType.iPhone-14 com.apple.CoreSimulator.SimRuntime.iOS-16-4) echo "${phone1}" boot1=$(/Applications/Xcode_14.3.1.app/Contents/Developer/usr/bin/simctl boot ${phone1}) echo "${boot1}" phone2=$(/Applications/Xcode_14.3.1.app/Contents/Developer/usr/bin/simctl create iPhone-sios1839ti com.apple.CoreSimulator.SimDeviceType.iPhone-14 com.apple.CoreSimulator.SimRuntime.iOS-17-0) echo "${phone2}" boot2=$(/Applications/Xcode_14.3.1.app/Contents/Developer/usr/bin/simctl boot ${phone2}) rm -r ~/Library/Caches/* rm -r ~/Library/Developer/CoreSimulator/Caches/* sleep 120 nav1=$(/Applications/Xcode_14.3.1.app/Contents/Developer/usr/bin/simctl openurl booted 'https://google.com') echo "${nav1}" echo "${boot2}" rm -r ~/Library/Caches/* rm -r ~/Library/Developer/CoreSimulator/Caches/* sleep 120 nav2=$(/Applications/Xcode_14.3.1.app/Contents/Developer/usr/bin/simctl openurl ${phone2} 'https://google.com') echo "${nav2}"

ilia-shipitsin commented 6 months ago

@mikehardy , what caught my eyes is really huge simulator log. I noticed that tests run in debug mode. can we try to either disable logging or reduce them to release ?

I collected logs from macos-12 and macos-13 and parsed them by Microsoft LogParser using the following query

SELECT COUNT(*) AS Total, SUBSTR(EXTRACT_PREFIX(Field1,0,'['),24) AS Service FROM 'C:\i\simulator-log\13\simulator.log'
WHERE Field1 like '2023%'
GROUP BY Service
ORDER BY Total DESC

image

image

something interesting with SpringBoard. what's that )) ?

(well, I suspect it might be a regression on XCode itself or some side effect of running simulators on virtualized hardware, @NorseGaud do you have any idea why SpringBoard can be that noisy under anka platform maybe ?)

ilia-shipitsin commented 6 months ago

the most noisy SpringBoard activity is [com.apple.xpc:connection]

image

ilia-shipitsin commented 6 months ago

as for "xpc connection" ....

image

mikehardy commented 6 months ago

@mikehardy , what caught my eyes is really huge simulator log. I noticed that tests run in debug mode. can we try to either disable logging or reduce them to release ?

Well, anything is possible @ilia-shipitsin :-) - my use case for the simulator.log is so that I can troubleshoot things like this when they go wrong though - it is not always our app, sometimes it is some rogue networking thing though, and the simulator.log is how it is discovered - as you've done (awesome by the way)

I'm willing to try anything but I don't know how I'll do a more efficient job than you are already doing on the fork. From the perspective of our workflows the simulator.log is not used as any sort of end product it is just for troubleshooting, so you could disable the simuator.log capture / zip / upload entirely and our workflow would still be doing it's job (assuming there wasn't an app crash where I needed to grab the stack trace...)

I'm going to guess though, disabling the simulator log will provide a substantial speedup by removing IO, but if you assume that the job is compute-bound and the IO is streaming/buffered then my guess is the macos13/Xcode15/ios17 slowdown we are all seeing is because the Simulator's networking/springboard subsysytem is doing something terrible that is also CPU heavy, so the speedup will not be nearly enough to get back to macOS12/Xcode14/ios16 levels

https://developer.apple.com/documentation/xpc/1448777-xpc_connection_cancel - XPC is some interprocess communication, not sure which processes are trying to communicate and/or why it's failing but something is definitely not happy on these new versions of the iOS simulator stack

ilia-shipitsin commented 6 months ago

@mikehardy , we did more investigation (will provide result later). also, we tried to run react-native-firebase on arm64 (also virtualized) runner

https://github.com/bbq-beets/react-native-firebase/actions/runs/7030335355/job/19129684183

as far as I understand, arm64 does not support nested virtualization, but it should only affect android simulators, not iOS.

can we add some debug to find why 09:45:59.561 detox[12510] ERROR: [APP_UNREACHABLE] Detox can't seem to connect to the test app(s)! ?

mikehardy commented 6 months ago

@ilia-shipitsin sorry for the delay!

The test app connection is the last step in a wobbly tower of things that have to go correctly, it happens when the test infrastructure correctly requests Simulator start, the operating system has started the simulator correctly, the simulator has booted completely and the test infrastructure can see it is up, then the test infrastructure has correctly loaded the app on to the simulator and asked it to start and finally the app itself has started completely and fetched the javascript bundle from the bundle server, loaded it and started executing

Determining why it did not start is what the simulator.log running in Debug is for :-)

I look for a few markers to see what stage things got to. One is the case-insensitive string "crashlytics" - if this never shows up, the app native code never booted and ran so as a binary search, we're looking at simulator startup failure or app load/start failure (where failure may be timeout / just took too long)

If "crashlytics" shows up then things should start happening pretty quickly and I search for the app name and/or "react" and/or "rnfb" to see if the javascript bundle loaded and app components started loading and getting chatty

My guess is this was just more poor performance (a 2GB+ log file! ouch) meaning things were happening slowly enough that the test infrastructure considered it a failure and timed it out

jeanregisser commented 6 months ago

Hey, I'm dealing with similar issues using macos-13-xlarge (i.e. running on M1).

I was unable to get Detox to work using multiple workers. Apparently it fails too boot the simulators, or they are really slow. I see a lot of 14:34:36.027 detox[60482] i Error: Unable to update lock within the stale threshold right after starting the detox test. See the full logs.

Then I tried using a single worker. And things got better. But it was still quite slow and it reached the 45 mins timeout I had set for that part of the workflow. See the full logs.

This was using Xcode 14.3.1 and iOS 16.4 simulators.

Is there a limitation on how many simulators can be run? Well there's always a limit of course, but I'd expect to be able to run more than 1 simulator. Note: we also have dedicated macOS runners we maintain which are able to spin up 6 simulators without sweating. And that's still on Intel CPUs (i7/3.2Ghz/6C/64G). We were hoping to replace them with the new macos-13-xlarge runners.

Anyway, thanks for all the useful info in this thread. I'm gonna try a few more things. But let me know if there's anything I can do to help. 🙏

ilia-shipitsin commented 6 months ago

crashlytics

thank you, it was helpful.

from current observation it looks like simulators are created on arm64, but due to degraded performance it looks like they are not responsive.

NorseGaud commented 6 months ago

Quick update from my testing with react-native-firebase:

If I run this manually, I can choose the GPU type in the Simulator menu and see different times with a slight performance improvement using Integrated GPU.

Integrated GPU in Simulator:
SIMCTL_CHILD_GULGeneratedClassDisposeDisabled=1 ./node_modules/.bin/nyc yarn   13.97s user 2.49s system 6% cpu 4:29.95 total

Discrete GPU in Simulator:
SIMCTL_CHILD_GULGeneratedClassDisposeDisabled=1 ./node_modules/.bin/nyc yarn   14.56s user 3.61s system 5% cpu 5:07.03 total

Not much of a difference, but it's something to note as a possibility for improvement.

The messaging() tests timeout in any kind of virtualization, but seems like all other tests work fine. Maybe relevant for @mikehardy. I collected the Console logs from macOS while the test was seemingly hanging.

console-messaging-full.log console-messaging-errors-only.log

I don't know what firebase.messaging in the test is actually doing, but I do think there may be something from the Console logs that the developers could see that helps us pinpoint for Apple what's wrong.

mikehardy commented 6 months ago

Interesting @NorseGaud - do you have a workflow run URL you can point me to where you extracted those logs? Or could you specify the execution environment? I'm to go on a hunch that this was an Apple Silicon machine of some sort?

That is a difference (intel silicon mac vs apple silicon mac) that varies the messaging testing as apple silicon macs with latest emulators can actually generate APNS tokens and receive APNS messages and we attempt to test that if it's recent enough simulator on apple silicon. It should work of course, but - for the purposes of the testing here, we are an open source repository limited to running on the currently available intel silicon runners, so I don't think it's germane to the current focus here unfortunately

NorseGaud commented 6 months ago

do you have a workflow run URL you can point me to where you extracted those logs

I don't :( I set up the project manually in a VM of several virtualization tools and got the same results across them all. The error is identical to the ones in the github runners.

Specs for the logs provided:

Understood about not testing on Arm, though the result is exactly the same on macOS 13-14 and Xcode 15.x, regardless of the architecture. It's just faster and easier for me to test on ARM right now :)

Unless there is something in these logs that indicates a problem with a service that VMs don't have or isn't working right, I do worry we won't be able to do much until we can describe in detail what Apple has to fix :(

NorseGaud commented 6 months ago

I am unable to reproduce boot issues anymore with iOS 17.2. The simulators load just fine (but are still a bit heavy on usage) for my test apps and no longer hang.

Regarding certain functions, I don't see a difference though.

mikehardy commented 6 months ago

Good to know @NorseGaud but unfortunately does not look like that's available in hosted runners yet https://github.com/actions/runner-images/blob/main/images/macos/macos-13-Readme.md#xcode

NorseGaud commented 6 months ago

Sorry @mikehardy , I was speaking about simulator booting issues. The issues with your tests persist AFAIK.