diafygi / acme-tiny

A tiny script to issue and renew TLS certs from Let's Encrypt
MIT License
4.73k stars 572 forks source link

Allow retry of network requests #248

Closed bwachter closed 3 years ago

bwachter commented 4 years ago

Clownflare seems to be randomly dropping connections from some networks The more domains are required to be verified in a request the more likely it is for the script to fail. While sometimes the problem happens early on I've mostly seen it on verifying the second or third domain. There doesn't seem to be any impact of excessive re-running of the script - a request for 4 domains went through after a bit over 100 tries, without sleeps in between.

As a quick workaround I've added the following in _do_request:

except SocketError as e:
      time.sleep(5)
      return _do_request(url, data, err_msg, depth)

With this a request for 14 names went through, at 13 retries unevenly spaced out.

Having something similar with a configurable amount of retries and maybe a default retry count of 3 (all my requests would've gone through with that) would help dealing with unreliable upstream.

karakays commented 4 years ago

:+1: for the issue Recently, I got the following HTTP error from the server, my certificates got expired without any notice.

    signed_crt = get_crt(args.account_key, args.csr, args.acme_dir, log=LOGGER, CA=args.ca, disable_check=args.disable_check, directory_url=args.directory_url, contact=args.contact)
  File "/etc/letsencrypt/acme-tiny/acme_tiny.py", line 164, in get_crt
    certificate_pem, _, _ = _do_request(order['certificate'], err_msg="Certificate download failed")
  File "/etc/letsencrypt/acme-tiny/acme_tiny.py", line 46, in _do_request
    raise ValueError("{0}:\nUrl: {1}\nData: {2}\nResponse Code: {3}\nResponse: {4}".format(err_msg, url, data, code, resp_data))
ValueError: Certificate download failed:
Url: https://acme-v02.api.letsencrypt.org/acme/cert/xxx
Data: None
Response Code: 502
Response: {'type': 'urn:acme:error:serverInternal', 'detail': 'The service is down for maintenance or had an internal error. Check https://letsencrypt.status.io/ for more details.'}
diafygi commented 3 years ago

I think I'd rather network interruptions raise a hard fail instead of blindly retrying multiple times, since significant retries can lead to banning/rate-limiting on networks that watch for spammy behavior.

bwachter commented 3 years ago

As I was writing in the initial comment, the problem gets worse the more domains you have in a request - you typically can get 3-4 domains verified after a few tries, but if you have more the chance of having all verified successfully drops with each domain you add.

If I have 5 domains, and no retry capability in the script I need to run verification against all of them until all succeed. Assuming I need 20 tries to finish that (which in my experience is on the lower end when hitting this issue), and on average it drops out at the 3rd domain I end up with 60+ verification requests.

Now if the script has retry support it'll just retry for the failing domain, which - outside of very rare circumstances - will usually go through within 2-3 tries. So we have 60+ calls vs. less then 10 calls - having retry support in the script would significantly reduce the chance of getting banned or rate limited, and is far simpler than having to script the same logic in a wrapper around the script. Also, if you're hitting this problem with more then about 5 domains retry inside of the script is the only way to get a request through.