Closed fpkamp closed 9 months ago
@mikehardy , what caught my eyes is really huge simulator log. I noticed that tests run in debug mode. can we try to either disable logging or reduce them to release ?
I collected logs from macos-12 and macos-13 and parsed them by Microsoft LogParser using the following query
SELECT COUNT(*) AS Total, SUBSTR(EXTRACT_PREFIX(Field1,0,'['),24) AS Service FROM 'C:\i\simulator-log\13\simulator.log'
WHERE Field1 like '2023%'
GROUP BY Service
ORDER BY Total DESC
something interesting with SpringBoard. what's that )) ?
(well, I suspect it might be a regression on XCode itself or some side effect of running simulators on virtualized hardware, @NorseGaud do you have any idea why SpringBoard can be that noisy under anka platform maybe ?)
the most noisy SpringBoard
activity is [com.apple.xpc:connection]
as for "xpc connection" ....
@mikehardy , what caught my eyes is really huge simulator log. I noticed that tests run in debug mode. can we try to either disable logging or reduce them to release ?
Well, anything is possible @ilia-shipitsin :-) - my use case for the simulator.log is so that I can troubleshoot things like this when they go wrong though - it is not always our app, sometimes it is some rogue networking thing though, and the simulator.log is how it is discovered - as you've done (awesome by the way)
I'm willing to try anything but I don't know how I'll do a more efficient job than you are already doing on the fork. From the perspective of our workflows the simulator.log is not used as any sort of end product it is just for troubleshooting, so you could disable the simuator.log capture / zip / upload entirely and our workflow would still be doing it's job (assuming there wasn't an app crash where I needed to grab the stack trace...)
I'm going to guess though, disabling the simulator log will provide a substantial speedup by removing IO, but if you assume that the job is compute-bound and the IO is streaming/buffered then my guess is the macos13/Xcode15/ios17 slowdown we are all seeing is because the Simulator's networking/springboard subsysytem is doing something terrible that is also CPU heavy, so the speedup will not be nearly enough to get back to macOS12/Xcode14/ios16 levels
https://developer.apple.com/documentation/xpc/1448777-xpc_connection_cancel - XPC is some interprocess communication, not sure which processes are trying to communicate and/or why it's failing but something is definitely not happy on these new versions of the iOS simulator stack
@mikehardy , we did more investigation (will provide result later). also, we tried to run react-native-firebase on arm64 (also virtualized) runner
https://github.com/bbq-beets/react-native-firebase/actions/runs/7030335355/job/19129684183
as far as I understand, arm64 does not support nested virtualization, but it should only affect android simulators, not iOS.
can we add some debug to find why 09:45:59.561 detox[12510] ERROR: [APP_UNREACHABLE] Detox can't seem to connect to the test app(s)!
?
@ilia-shipitsin sorry for the delay!
The test app connection is the last step in a wobbly tower of things that have to go correctly, it happens when the test infrastructure correctly requests Simulator start, the operating system has started the simulator correctly, the simulator has booted completely and the test infrastructure can see it is up, then the test infrastructure has correctly loaded the app on to the simulator and asked it to start and finally the app itself has started completely and fetched the javascript bundle from the bundle server, loaded it and started executing
Determining why it did not start is what the simulator.log running in Debug is for :-)
I look for a few markers to see what stage things got to. One is the case-insensitive string "crashlytics" - if this never shows up, the app native code never booted and ran so as a binary search, we're looking at simulator startup failure or app load/start failure (where failure may be timeout / just took too long)
If "crashlytics" shows up then things should start happening pretty quickly and I search for the app name and/or "react" and/or "rnfb" to see if the javascript bundle loaded and app components started loading and getting chatty
My guess is this was just more poor performance (a 2GB+ log file! ouch) meaning things were happening slowly enough that the test infrastructure considered it a failure and timed it out
Hey, I'm dealing with similar issues using macos-13-xlarge
(i.e. running on M1).
I was unable to get Detox to work using multiple workers.
Apparently it fails too boot the simulators, or they are really slow.
I see a lot of 14:34:36.027 detox[60482] i Error: Unable to update lock within the stale threshold
right after starting the detox test.
See the full logs.
Then I tried using a single worker. And things got better. But it was still quite slow and it reached the 45 mins timeout I had set for that part of the workflow. See the full logs.
This was using Xcode 14.3.1 and iOS 16.4 simulators.
Is there a limitation on how many simulators can be run?
Well there's always a limit of course, but I'd expect to be able to run more than 1 simulator.
Note: we also have dedicated macOS runners we maintain which are able to spin up 6 simulators without sweating. And that's still on Intel CPUs (i7/3.2Ghz/6C/64G). We were hoping to replace them with the new macos-13-xlarge
runners.
Anyway, thanks for all the useful info in this thread. I'm gonna try a few more things. But let me know if there's anything I can do to help. 🙏
crashlytics
thank you, it was helpful.
from current observation it looks like simulators are created on arm64, but due to degraded performance it looks like they are not responsive.
Quick update from my testing with react-native-firebase
:
If I run this manually, I can choose the GPU type in the Simulator menu and see different times with a slight performance improvement using Integrated GPU.
Integrated GPU in Simulator:
SIMCTL_CHILD_GULGeneratedClassDisposeDisabled=1 ./node_modules/.bin/nyc yarn 13.97s user 2.49s system 6% cpu 4:29.95 total
Discrete GPU in Simulator:
SIMCTL_CHILD_GULGeneratedClassDisposeDisabled=1 ./node_modules/.bin/nyc yarn 14.56s user 3.61s system 5% cpu 5:07.03 total
Not much of a difference, but it's something to note as a possibility for improvement.
The messaging()
tests timeout in any kind of virtualization, but seems like all other tests work fine. Maybe relevant for @mikehardy. I collected the Console logs from macOS while the test was seemingly hanging.
console-messaging-full.log console-messaging-errors-only.log
I don't know what firebase.messaging
in the test is actually doing, but I do think there may be something from the Console logs that the developers could see that helps us pinpoint for Apple what's wrong.
Interesting @NorseGaud - do you have a workflow run URL you can point me to where you extracted those logs? Or could you specify the execution environment? I'm to go on a hunch that this was an Apple Silicon machine of some sort?
That is a difference (intel silicon mac vs apple silicon mac) that varies the messaging testing as apple silicon macs with latest emulators can actually generate APNS tokens and receive APNS messages and we attempt to test that if it's recent enough simulator on apple silicon. It should work of course, but - for the purposes of the testing here, we are an open source repository limited to running on the currently available intel silicon runners, so I don't think it's germane to the current focus here unfortunately
do you have a workflow run URL you can point me to where you extracted those logs
I don't :( I set up the project manually in a VM of several virtualization tools and got the same results across them all. The error is identical to the ones in the github runners.
Specs for the logs provided:
Understood about not testing on Arm, though the result is exactly the same on macOS 13-14 and Xcode 15.x, regardless of the architecture. It's just faster and easier for me to test on ARM right now :)
Unless there is something in these logs that indicates a problem with a service that VMs don't have or isn't working right, I do worry we won't be able to do much until we can describe in detail what Apple has to fix :(
I am unable to reproduce boot issues anymore with iOS 17.2. The simulators load just fine (but are still a bit heavy on usage) for my test apps and no longer hang.
Regarding certain functions, I don't see a difference though.
Good to know @NorseGaud but unfortunately does not look like that's available in hosted runners yet https://github.com/actions/runner-images/blob/main/images/macos/macos-13-Readme.md#xcode
Sorry @mikehardy , I was speaking about simulator booting issues. The issues with your tests persist AFAIK.
Description
Hi, I have been redirected from GitHub Support to describe our use case and perhaps influence future performance of your runner for macOS. We have a use case where we need to use XCode 14.3.1 and XCode 15.0 on the macOS13 runner. We create simulators with iPhones and navigate to a url and then we dispose the simulator. All of the above works pretty well locally, but it fails when executing in GitHub Actions. The behavior between XCode versions is different, with XCode 14 timing out on booting or navigating, however XCode 15 fails on other functions (perhaps an effect of changes to XCode itself) like binding launchd_sim. An example failure message I get would be:
In the ticket that has originally redirected me here, Arthur says he was successful with adding a cache clean (
~/Library/Caches/
and~/Library/Developer/CoreSimulator/Caches/
) and waiting 60-120 seconds, however upon trying that I observed that it only works for a single simulator and subsequent simulators would fail (at least using a rinse and repeat approach).Platforms affected
Runner images affected
Image version and build link
version: 20230611.2 workflow run: https://github.com/fingerprintjs/fingerprintjs-pro/actions/runs/5646633166/job/15295123030
Is it regression?
no
Expected behavior
Simulators work smoothly and boot / open urls without crashing.
Actual behavior
Attempts to boot simulators and navigate to a URL fail very frequently.
Repro steps
Use the following script in a macOS 13 runner workflow:
open /Applications/Xcode_14.3.1.app/Contents/Developer/Applications/Simulator.app/ phone1=$(/Applications/Xcode_14.3.1.app/Contents/Developer/usr/bin/simctl create iPhone-hzso25lt6h9 com.apple.CoreSimulator.SimDeviceType.iPhone-14 com.apple.CoreSimulator.SimRuntime.iOS-16-4) echo "${phone1}" boot1=$(/Applications/Xcode_14.3.1.app/Contents/Developer/usr/bin/simctl boot ${phone1}) echo "${boot1}" phone2=$(/Applications/Xcode_14.3.1.app/Contents/Developer/usr/bin/simctl create iPhone-sios1839ti com.apple.CoreSimulator.SimDeviceType.iPhone-14 com.apple.CoreSimulator.SimRuntime.iOS-17-0) echo "${phone2}" boot2=$(/Applications/Xcode_14.3.1.app/Contents/Developer/usr/bin/simctl boot ${phone2}) rm -r ~/Library/Caches/* rm -r ~/Library/Developer/CoreSimulator/Caches/* sleep 120 nav1=$(/Applications/Xcode_14.3.1.app/Contents/Developer/usr/bin/simctl openurl booted 'https://google.com') echo "${nav1}" echo "${boot2}" rm -r ~/Library/Caches/* rm -r ~/Library/Developer/CoreSimulator/Caches/* sleep 120 nav2=$(/Applications/Xcode_14.3.1.app/Contents/Developer/usr/bin/simctl openurl ${phone2} 'https://google.com') echo "${nav2}"