Closed AlexeyRaga closed 7 months ago
In fact not every single request fails. It works for a while (sometimes 10 seconds, sometimes 5 minutes) and then crashes with this error.
What I am doing is copying messages from one SQS queue to another and sometimes I manage to copy many, sometimes it does almost immediately.
I'm having TLS errors as well:
TransportError (TlsExceptionHostPort (Terminated True "received fatal error: BadRecordMac" (Error_Protocol ("remote side fatal error",True,BadRecordMac))) "ec2.us-west-2.amazonaws.com" 443)
Same here, quite frequently with S3:
TransportError (TlsExceptionHostPort (Terminated True "received fatal error: BadRecordMac" (Error_Protocol ("remote side fatal error",True,BadRecordMac))) "s3-eu-west-1.amazonaws.com" 443)
For the record: I do not think both errors are necessarily related.
@AlexeyRaga: https://github.com/haskell-works/sqs-resurrector is the application you are running? I'll get it set up and see if I can reproduce.
@kim Both errors are coming from hs-tls
, the first is reminiscent of a cipher error - I haven't seen the second before. Any suggestions how to reproduce a minimal example? Are any particular types of request failing?
I am basically fetching objects from S3, massaging them and sticking them into DynamoDB. Since this is all a conduit
pipeline, it may very well be that the time between successive requests to S3 exceeds 5 seconds -- which we have previously found to be the timeout after which S3 closes idle connections. This leads me to believe that this is just a different incarnation of a problem we have seen before: http-client
's Manager
has a hardcoded idle limit of 30 seconds, so it may use a connection that is already closed by the remote side. The exception can just be caught on the application layer and the request retried, which will eject the bad connection from the pool and create a fresh one.
Admittedly rather brittle, but the best http-client
could do is to allow users to configure the idle timeout. But that doesn't buy us anything if we use the same Manager
for different AWS services, which likely have different timeouts.
This looks related to this http-client error https://github.com/snoyberg/http-client/pull/226 (see also https://github.com/erikd-ambiata/test-warp-wai/issues/1), had been seeing regular similar issues on corresponding versions.
@kim @markhibberd Were you able to solve or work around this issue? It has been a pain in the neck recently :(
@AlexeyRaga Our current solution approach is just locking connection down to connection == 0.2.5
and make sure you increase the default amazonka retry policy which is a bit light on - https://github.com/ambiata/mismi/blob/master/mismi-cli/main/s3.hs#L111 is an example of doing so for s3 (just update the s3 service for sqs for you example). If you use stackage you may be out of luck though, no idea if you will be able to work around it.
@AlexeyRaga Same here: just retry
There appears to be a fix contained in the newest tls
version which may address some of the above problems, but, I suspect not all.
The amazonka/amazonka.cabal
file in develop
now requires tls >= 1.3.9
, and the non-GHC8 stack configuration is in the process of being updated.
I'm actually not convinced this 'fix' is a good idea, as if it doesn't correct all of the issues it precludes the ability of a downstream user to pick connection == 0.2.5
. Will consider, please share any thoughts.
Reverting. :disappointed: I'll leave it up to the downstream user to constrain to tls >= 1.3.9
.
Marking as a "post 2.0" release, since we'll want to see how things behave after everything lands in develop
. Note that tls
is up to 1.5.x
now.
I'm going to close this off. connection
is up to 0.3.1
; tls
is up to 2.0.2
on Hackage and 1.8.0
on nixpkgs master
. I don't think anyone will need/want/be able to pin to >=7-year-old versions of stuff with amazonka-2.x.
I wasn't able to reopen an issue #269 so I am creating a new one.
I am still having this issue with a newest
tls-1.3.8
: