flexera-public / right_aws

RightScale Amazon Web Services Ruby Gems
MIT License
450 stars 175 forks source link

Hung Connections leading to RequestTimeTooSkewed #24

Open loe opened 14 years ago

loe commented 14 years ago

I am trying to use right_aws + right_http_connection to use one persistent connection per-process to reduce the overhead of dealing with S3.

I've got a module in lib/onehub.rb that keeps the connection and bucket objects. https://gist.github.com/7ef90619fbd331479c6a

Then from my models I'll call something like Onehub.bucket.put to upload the file in background task, with the idea that this should be a persistent connection since these background workers are simply uploaders.

What I get quite frequently is 'hung' sockets. The socket doesn't get written to for > 15 minutes, but eventually this recovers (maybe related to a thread not getting scheduled?). The problem is that the request is signed and we've now left the 15 minute grace period that S3 will tolerate so I get the exception: RightAws::AwsError: RequestTimeTooSkewed: The difference between the request time and the current time is too large.

Here is a backtrace: https://gist.github.com/26ddd66d2cc5de223c9c

Is there a better way to handle a per-process persistent connection? Is this some subtle threading issue where the thread that writes isn't being scheduled by the interpreter? I am not using this in a multi-threaded environment. Is this because S3 hangs up after 60 seconds but the library expects the connection to still be open?

We diagnosed the issue by instrumenting the PUT operations and dumping to a log file, but could never create a case that reliably reproduced it.

yodal commented 14 years ago

Have been suffering from the same issue here.

loe commented 14 years ago

The hung sockets? I keep thinking this is related to Green Threading but I have no way to reproduce it!

yodal commented 14 years ago

Nor me, yet. We are getting a 'RequestTimeTooSkewed' error around once a day. Have you found a workaround?

cdunn commented 14 years ago

I'm in the same boat. Anyone have any solutions?

basex commented 13 years ago

Anyone found a solution to this? I have the system time of my server updated and I still get this error many times a day.

loe commented 13 years ago

Switched to using aws-s3, no problems. There is something in the threading code that causes hangs and then when the request is retried the headers are not regenerated.

basex commented 13 years ago

I'm using a folder structure + EU buckets and both are not so well supported by aws-s3 =/

konstantin-dzreev commented 13 years ago

Hi,

Is RequestTimeTooSkewed you are discussing an HTTP 403 RequestTimeTooSkewed error?

If yes, can you just rescue it an do retry the upload attempt?

If no, can you plz describe the error (http error code and its message)? (s3interface.last_response.code and s3interface.last_response.body)

Thanks

loe commented 13 years ago

RequestTimeTooSkewed is the result of a 403 yes, but it happens because RightAws gets in a state that prevents any data from being written to the socket.

At some point it just hangs and then when it eventually starts (it takes like 15 minutes!) it retries with the same headers as the original request, which is outside Amazon's acceptable time window, throwing the 403. Rescue -> Retry works, but why does it hang for 15 minutes in the first place!

vivienschilis commented 13 years ago

I have the same problem and my platform uploads several GB per day and I have the problem a dozen times per day. (using the threaded option)

konstantin-dzreev commented 13 years ago

Hi,

Right_aws does not support multi-threading (and we don't have that option any more). If you need multiple threads then you must have RightAws::S3 or RightAws::S3Interface instance per thread. Once created the RightAws::S3 instance must be used in the thread if was created.

Plz make sure you do this and you do not access one RightAws::S3 instance from different threads

vivienschilis commented 13 years ago

I actually don't use thread and keep getting this error. Those 15minutes seems to correspond to the 900000 milliseconds return by AWS.

<MaxAllowedSkewMilliseconds>900000</MaxAllowedSkewMilliseconds>

How can I log what is going on during those 15 minutes?

conradchu commented 13 years ago

Yup, finally got the same error today. Very annoying.

konstantin-dzreev commented 13 years ago

Hi All

As I see some Pyton guys run into the error also using Pyton's boto library: http://www.saltycrane.com/blog/2010/04/using-python-timeout-decorator-uploading-s3

I'm not sure what we can fix there because we use Ruby 'net/https' library. Are you 101% sure there is not a time sync issue between your boxes and Amazon?

Any case any help with debugging that is very appreciated!

Konstantin

vivienschilis commented 13 years ago

I think it's an S3 Pb when you don't don't specify the endpoint. The pb is that RightAWS wait 15 min before noticing the request fails (maybe due to right_http_connection which retries with the same headers? without closing and reopening the socket)

I have switched to Fog and I don't have any pb.

conradchu commented 13 years ago

Actually, I got it working again. I realized the system time of my xen instances was drifting from actual time and didn't have ntpd running.

bradly commented 13 years ago

We are getting this error in our app form some calls that are made from our Delayed Jobs queue.

konstantin-dzreev commented 13 years ago

Plz make sure that the box that performs requests does not have system time issues (ntpd etc)

bradly commented 13 years ago

We do not have any issues with time or ntpd. I think it may be due to delayed job running as a daemon, but I'm not sure.

conradchu commented 13 years ago

@bradly, you want to check how the time is being evaluated by delayed_job. Since delayed_job uses YAML to serialize the AR object, there is an outstanding YAML bug that we've found that might affect your time.

https://rails.lighthouseapp.com/projects/8994/tickets/340-yaml-activerecord-serialize-and-date-formats-problem

ericmason commented 11 years ago

I'm having the same issue. Anyone find a work-around?