amphp / dns

Async DNS resolution for PHP based on Amp.
https://amphp.org/dns
MIT License
157 stars 32 forks source link

DNS resolution failed -- possibly cache bug #16

Closed thomaswilde closed 8 years ago

thomaswilde commented 9 years ago

We get an Amp\Dns\ResolutionException with the message

"DNS resolution failed: www.google.com"

These are the last couple of lines from our stack-trace:

#0 ...\vendor\amphp\dns\lib\Client.php(298): Amp\Dns\Client->completePendingLookup(0, NULL, 3)
#1 ...\vendor\amphp\dns\lib\Client.php(220): Amp\Dns\Client->completeRequest(Array, NULL, 3)
#2 ...\vendor\amphp\amp\lib\NativeReactor.php(184): Amp\Dns\Client->Amp\Dns\{closure}(Object(Amp\NativeReactor), 2)
#3 ...\vendor\amphp\amp\lib\NativeReactor.php(123): Amp\NativeReactor->executeAlarms()

It seems that the problem originates in lib/Client.php on line 311 where $cacheHit is true but $addr is NULL. Google is reachable by DNS from the environment in question.

DaveRandom commented 9 years ago

Which cache implementation are you using?

thomaswilde commented 9 years ago

The default Amp\Dns\Cache\MemoryCache

DaveRandom commented 9 years ago

Can you reliably reproduce this or was it a one-off?

thomaswilde commented 9 years ago

Neither; it occurs randomly over time on a live server and monitoring alerts me maybe once an hour.

DaveRandom commented 9 years ago

When this happens, does it cause the process to bail and restart, or does it seem to resolve itself and continue?

I'm wondering if there is an edge case when the TTL of the cached record expires.

thomaswilde commented 9 years ago

The exception is only caught by the topmost (catch-all) error handler that does the reporting and serves a "pretty" error page to the client. So - in short - the process is unaffected.

DaveRandom commented 9 years ago

Can you try applying this patch please?

It should either fix the problem or provide a more useful stack trace.

Danack commented 9 years ago

@thomaswilde How many requests per second are you making and how many scripts do you have running in parallel?

The error is starting from here where the DNS lookup has timed out. I don't see that it's anything to do with the cache.

If you're just using the built in memory cache, then DNS requests are not cached between different runs of a script, and so it is likely that you'll see some fail at some point.

You should have slightly better performance and reliability if you switch to using one of either apc or redis caches for DNS lookups.

thomaswilde commented 9 years ago

Thank you, I will apply the patch and report back as soon as I get an alert.

@Danack I'd say about 1000 RPS peak. Around 8 FastCGI processes. Most OSs provide a DNS cache themselves. Maybe deactivating the (code's) cache entirely would be a sensible solution for us.

bwoebi commented 9 years ago

@thomaswilde I don't think this would be a sensible alternative. amphp/dns directly talks to the DNS server. This doesn't cross the OS DNS cache and so also everything resolved here also won't be cached by the OS. So if you disable cache here you'll just end up requesting the same resource again and again from the DNS server.

rdlowrey commented 9 years ago

Here's a thought ...

DNS does use UDP which is generally referred to as an "unreliable" protocol. There will be times, especially under high load, when UDP packets get lost on the interwebs and this error arises organically through no fault of the actual source code.

It seems sensible to me that we should implement a simple exponential backoff algorithm and shorten the default timeout for DNS requests. In that way we can head off problems like this even in environments where a shared cache like APC is unavailable. A new MAX_DNS_RETRIES option (or some such thing) could be added to govern this behavior.

thomaswilde commented 9 years ago

Sorry for the delay, I've been unable to deploy the patch to a live environment today because business. Thank you for your patience.

thomaswilde commented 9 years ago

Here's the top of a stack trace produced by our production environment with the patch applied

#0 ...\vendor\amphp\dns\lib\Client.php(298): Amp\Dns\Client->completePendingLookup(0, NULL, 3)
#1 ...\vendor\amphp\dns\lib\Client.php(220): Amp\Dns\Client->completeRequest(Array, NULL, 3)
#2 ...\vendor\amphp\amp\lib\NativeReactor.php(184): Amp\Dns\Client->Amp\Dns\{closure}(Object(Amp\NativeReactor), 10)
#3 ...\vendor\amphp\amp\lib\NativeReactor.php(123): Amp\NativeReactor->executeAlarms()
#4 ...\vendor\amphp\amp\lib\Future.php(90): Amp\NativeReactor->tick()
#5 ...\GP\Services\FooBarService.php(54): Amp\Future->wait()
#6 ...\GP\Services\FooBarService.php(37): GP\Services\FooBarService->requestRatings()

Doesn't seem more informative than the first one :-/

I must mention that we can't run APC for technical reasons and we don't use Redis. The only shared cache we have available is memcached.

PS: FooBarService is anonymized, it's called something else in reality.

DaveRandom commented 9 years ago

You are getting timeouts from the DNS server - i.e. the server is not responding to your request in a timely fashion (default 2 seconds). This is probably because you are hammering the server and it's ignoring your request because you have exceeded some threshold it has internally, either that or there is a problem with the server.

If it's the former (likely) then a shared cache would help. I'd imagine a memcached-based cache should be easy to create, I'll look into it shortly.

DaveRandom commented 9 years ago

@thomaswilde can you try out https://github.com/amphp/dns/pull/19?

Danack commented 9 years ago

@thomaswilde can you elaborate on "we can't run APC for technical reasons" ?

I actually do have reservation about using it by default as I have heard rumours that it is not as rock solid as you would hope, but would like to hear more concrete issues.

thomaswilde commented 9 years ago

@DaveRandom will do. @Danack we use the (Fast-)CGI SAPI in a configuration that does not allow for APC.

thomaswilde commented 9 years ago

Okay -- problem solved -- here's the resolution:

We did already cache the response to the request that we were executing with Artax through Memcached, a cache class external to the one kindly provided by @DaveRandom so I investigated further and it turns out that the memcached server was at fault -- it's a buggy port from the original linux version and it stops answering after about 8 hours.

Positive: We didn't even notice; implies the performance of our application is good enough even without a cache. Also; none of our components depend on memcached being up -- business continued undisrupted.

Negative: We didn't even notice; new measures will be taken so that when memcached stops answering we log and restart it automatically. The only one part of our application that depends on a cache is Artax/Dns.

Therefore -- and as a constructive suggestion, not a critique -- I'd like to open an issue to make this Dns library cache independent. Please advise.

DaveRandom commented 9 years ago

@thomaswilde Unsure whether that's a good idea or not, tbh.

The caching component can be treated as standalone, it is well encapsulated, and as such the idea may have some merit. However, none of the available implementations are currently non-blocking, I'm not sure how possible it would be to make them non-blocking without writing protocol implementations for the respective back-ends, so all that really leaves us with is the interface.

php-fig already have a proposed cache interface (https://github.com/php-fig/fig-standards/blob/master/proposed/cache.md), so I'm not sure if this would add any value or not.

If we have proper non-blocking support for caching at the back end, then it may have some value. In order to do this, there would need to be a guarantee that the get() callback will be invoked asynchronously, and this is not the case with the current implementations (it would not break amp/dns if they did behave this way, though).

I'm tempted to just go ahead and separate this out into a separate component as it would take pretty much zero effort to make it work independently as-is. The lack of true non-blocking support is a bit of a sticking point for me though.

Thoughts, @Danack @rdlowrey @bwoebi?

kelunik commented 9 years ago

A non-blocking amphp/cache is now a thing. Using the php-fig interface doesn't make any sense for us, because have to use other method signatures anyway (everything returns a promise).

The only shared driver currently implemented is Redis.

kelunik commented 8 years ago

We have a TCP retry now if it fails with UDP. Closing for now.

pensiero commented 8 years ago

Same problem here, someone could explain me how can I cache the DNS resolution? I'm using Redis, and I saw that there is amphp/redis, but I don't understand how to use it with DNS.

Thanks

kelunik commented 8 years ago

DNS records are currently automatically cached in an ArrayCache according to their TTL. There's currently no way to specify another cache method.