ExpoKit builds constantly failing due to "getaddrinfo ENOTFOUND d1wp6m56sqw74a.cloudfront.net"

keith-kurak commented 5 years ago

🐛 Bug Report

Environment

    System:
      OS: macOS 10.14.6
      Shell: 3.2.57 - /bin/bash
    Binaries:
      Node: 10.14.2 - ~/.nvm/versions/node/v10.14.2/bin/node
      Yarn: 1.16.0 - /usr/local/bin/yarn
      npm: 6.4.1 - ~/.nvm/versions/node/v10.14.2/bin/npm
      Watchman: 4.9.0 - /usr/local/bin/watchman
    IDEs:
      Android Studio: 3.4 AI-183.6156.11.34.5692245
      Xcode: 10.3/10G8 - /usr/bin/xcodebuild
    npmPackages:
      expo: ^34.0.1 => 34.0.2
      react: 16.8.3 => 16.8.3
      react-native: https://github.com/expo/react-native/archive/sdk-34.0.0.tar.gz => 0.59.8
      react-navigation: ^3.11.1 => 3.11.1
    npmGlobalPackages:
      expo-cli: 3.0.6

Target: iOS, Android (ExpoKit)

Steps to Reproduce

Regularly build ExpoKit iOS and Android binaries
Wait for this error to inevitably occur

Expected Behavior

All else equal, the build is successful

Actual Behavior

Maybe one every five builds, it will fail on :app:bundleReleaseExpoAssets or when running the equivalent build script step in iOS.

The error will be something like:

Error: getaddrinfo ENOTFOUND d1wp6m56sqw74a.cloudfront.net d1wp6m56sqw74a.cloudfront.net:443
    at errnoException (dns.js:55:10)
    at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:97:26)
Command /bin/sh failed with exit code 1

But I've also gotten

connect ETIMEDOUT

And a few times:

> Task :app:bundleReleaseExpoAssets
Opening `/dev/tty` failed (6): Device not configured
[09:26:26] ESOCKETTIMEDOUT

The last error was on an admittedly slow hotspot connection, but the rest happen regularly on solid wifi, even when I can maintain a 99.9% successful ping rate to Cloudfront.

Reproducible Demo

It's difficult to say with 100%. certainty what causes this. Certainly, the demo needs to be an ExpoKit project with bundled assets. More bundled assets may make the issue more likely (my project has around ~300, most of which are thumbnail images, but some are larger). I think the issue became more frequent when I added about 30-50 screen-sized images to the mix.

In terms of network and computer configuration, I don't think there is any specific marker of the issue being more likely. My computer has no network proxies or VPN enabled (it's a personal laptop, so no corporate network stuff at all). I've experienced the issue on AT&T Uverse and Spectrum cable internet at home, on public and home wifi at locations in Ohio, Indiana, Michigan, and Pennsylvania. I've talked to another user experiencing the issue in the UK. The issue persisted after I reformatted my computer entirely. This issue has occurred for me on Expo SDK's 29-34.

This issue is not correlated with poor network performance. I've maintained 99.5%+ ping rates between here and d1wp6m56sqw74a.cloudfront.net and Google.com while the issue has occurred. At the time the issue occurs, I've seen no other network issues with my computer or any other device on the network.

The issue may happen three times in a row and not happen for ten more builds after that. I build about 35 apps in ExpoKit (we white label), and I probably experience this issue on 5-10% of my builds.

Discussion

I'm currently working around this issue by using a custom node.js script that launches expo build via shelljs, reading sdrerr for one of the aforementioned errors, and restarting my build if one of those errors occurs. I don't doubt that there are no issues with the cloud provider (those things are pretty solid). It does, however, seem like the bundling assets step is awfully sensitive to anything less than the most ideal possible network conditions. Having some sort of retry for this process could save a lot of build time.

This is the same/related to the following issues: https://github.com/expo/expo/issues/3906 (no issue template) https://github.com/expo/expo/issues/1856 (closed by the original submitter without comment)

ide commented 5 years ago

It sounds like we may need some retry logic. The CDN is just plain CloudFront and one of the most reliable servers worldwide so I agree that’s probably not the problem.

vpontis commented 5 years ago

@ide I figured out how to fix this! (This issue has been haunting my ExpoKit project and it makes building new apps really, really difficult.)

The problem is that on this line of code you fire off a ton of requests that all hit the same URL.

At some level (router, ISP, other URL, etc) rate limiting happens so one of your requests is bound to fail.

Instead, you change the .map to a for ... loop to upload each asset in sequence.

I tested this by manually patching my expo-cli code to:

  for (const batch of batches) {
    for (const asset of batch) {
      const extensionIndex = asset.lastIndexOf('.');
      const prefixLength = 'asset_'.length;
      const hash = extensionIndex >= 0 ? asset.substring(prefixLength, extensionIndex) : asset.substring(prefixLength);
      await (0, _ExponentTools().saveUrlToPathAsync)(urlResolver(hash), _path().default.join(dest, asset));
    }
  }

This makes the build slower since it's in sequence rather than in parallel. But at least it builds!

keith-kurak commented 5 years ago

@vpontis Is the file you're patching in node_modules/@expo/xdl? Seems like expo bundle-assets in the expo-cli package is ultimately pointing there? I may try something similar myself.

To echo what you noted, the request maintainer basically said, over a long enough time making enough requests quick enough in succession, this sort of error is inevitable (https://github.com/request/request/issues/849#issuecomment-39869397). Given that, even though serializing the requests cuts down on some of the reasons it might fail, it doesn't cover all of them. But, looking at that example you posted, it seems like it might be pretty straightforward to wrap saveUrlToPathAsync() in something like node-promise-retry (https://github.com/IndigoUnited/node-promise-retry), maintaining the performance while making the error much less likely.

I wonder if OTA updates for Bare workflow would end up using completely different logic, eliminating this issue (thus, if it's on the horizon, working around this by restarting the entire build is an OK workaround at the moment IMO). This step in the process is downloading assets that are already stored in the project (because they were uploaded during the publish step). Ideally, the build could be done completely offline by reusing these same files.

vpontis commented 5 years ago

@vpontis Is the file you're patching in node_modules/@expo/xdl? Seems like expo bundle-assets in the expo-cli package is ultimately pointing there? I may try something similar myself.

Yep.

Given that, even though serializing the requests cuts down on some of the reasons it might fail, it doesn't cover all of them.

For me it goes from something that reliably fails to something that reliably works. I'll leave it to the someone with more knowledge to decide what the best fix is.

I'm just really happy that I can build apps now! And hopefully this helps other people.

ide commented 5 years ago

@fson could you look into this? Specifically, @vpontis has a very helpful diagnosis here: https://github.com/expo/expo-cli/issues/1022#issuecomment-534045935. We probably just need to add something like https://www.npmjs.com/package/p-limit.

vpontis commented 5 years ago

@fson awesome thank you! <3

expo / expo-cli