fog / fog-aws

Module for the 'fog' gem to support Amazon Web Services http://aws.amazon.com/
MIT License
296 stars 353 forks source link

AWS S3 file create HTTP 400 errors #693

Open wpeterson opened 9 months ago

wpeterson commented 9 months ago

We recently upgraded a Rails app from 3.20.0 to 3.21.0 and began seeing all of our writes/file creation calls to AWS S3 failing with a 400 error from S3s service calls:

Example exception raised from excon client layer:

Excon::Error::BadRequest Expected(200) <=> Actual(400 Bad Request)
excon.error.response
  :body              => "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<Error><Code>IncompleteBody</Code><Message>You did not provide the number of bytes specified by the Content-Length HTTP header</Message><NumberBytesExpected>228</NumberBytesExpected><NumberBytesProvided>0</NumberBytesProvided><RequestId>3SG15S95XAWWA2ZA</RequestId><HostId>g0MdNHKuphyF1e5ee7zt0g/C+4RBv4eKguXKul5TPkkK3esrFfzrEgiu63uDyIgSgl9f0FhJNpo=</HostId></Error>"

I don't see any changes in fog-aws or excon that appears related, but confirmed upgrading to fog-aws 3.21.0 and excon 0.104.0 causes this issue for us. Apologies for not having a simpler test case/reproduction, but thought I would log an issue in case other folks encounter this problem.

geemus commented 9 months ago

@wpeterson Thanks for the report. Are you using fog directly or through an intermediary like carrierwave? Could you share an example of how you are calling it in the vicinity of one of the errors?

A couple of guesses:

  1. We changed it to no longer send back a couple of the AWS headers, potentially this could mess with the length of things.
  2. We changed it to no longer retry client errors, but I would mostly expect what you describe to either work or not (and that retries wouldn't make a difference).

I'm not sure on either though, and could probably benefit from a bit more detail on your usage if you can share it.

wpeterson commented 8 months ago

@geemus Thank you for looking/thinking further about this. We are using both fog-aws and carrierwave, but the usage which produces this error appears to be a direct invocation of Fog via Fog::Storage#bucket / bucket.files.create invocation:

    bucket.files.create(
      key: key,
      body: file,
      multipart_chunk_size: 5242880
    )

I've also included a redacted version of our Gemfile.lock if that's helpful (without the fog-aws upgrade):

https://gist.github.com/wpeterson/d8785654dc8c6c1a9061eb9f9690498f

geemus commented 8 months ago

Thanks for the extra details, just a little slow getting back to it as I'm on parental leave right now with a new baby so things are a little chaotic.

A couple of follow up questions:

  1. Are the failing files bigger than the specified multipart_chunk_size? (I'm pretty sure this must be the case, but seems best to be explicit and certain).
  2. Can you tell from the error messages which underlying call is failing? ie I'm guessing this is using multipart uploading, if so is it failing on initiating the upload, uploading a part, finalizing the parts, or something else altogether.

The only difference between 3.20.0 and 3.21.0 should be https://github.com/fog/fog-aws/pull/691. On the plus side it's a fairly constrained change, on the less plus side, it's not immediately apparent to me how/why this would cause the problem, but hopefully since it's narrow we'll be able to pin it down without too much trouble.

wpeterson commented 8 months ago

@geemus Thanks for looking into this and for all the work that goes into open source. This isn't urgent and you shouldn't feel the need to look at this if you've got a new baby at home.

Here's a slightly redacted stack trace showing the 400 from the AWS S3 layer. You can see a request from our controller calling a helper class, which is calling a Fog layer directly:

Excon::Error::BadRequest: Expected(200) <=> Actual(400 Bad Request)
excon.error.response
  :body              => "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<Error><Code>IncompleteBody</Code><Message>You did not provide the number of bytes specified by the Content-Length HTTP header</Message><NumberBytesExpected>228</NumberBytesExpected><NumberBytesProvided>0</NumberBytesProvided><RequestId>3SG15S95XAWWA2ZA</RequestId><HostId>g0MdNHKuphyF1e5ee7zt0g/C+4RBv4eKguXKul5TPkkK3esrFfzrEgiu63uDyIgSgl9f0FhJNpo=</HostId></Error>"
  :cookies           => [
  ]
  :headers           => {
    "Connection"       => "close"
    "Content-Type"     => "application/xml"
    "Date"             => "Mon, 02 Oct 2023 20:42:25 GMT"
    "Server"           => "AmazonS3"
    "x-amz-id-2"       => "g0MdNHKuphyF1e5ee7zt0g/C+4RBv4eKguXKul5TPkkK3esrFfzrEgiu63uDyIgSgl9f0FhJNpo="
    "x-amz-request-id" => "3SG15S95XAWWA2ZA"
  }
  :host              => "s3.amazonaws.com"
  :local_address     => "REDACTED"
  :local_port        => 58182
  :method            => "PUT"
  :omit_default_port => false
  :path              => "/admin-uploads.bookbub.com/bulk_notable_listing_requests/739/Sourcebooks%20SL%20Nov%20add%20-%20Sheet1.csv"
  :port              => 443
  :query             => nil
  :reason_phrase     => "Bad Request"
  :remote_ip         => "REDACTED"
  :scheme            => "https"
  :status            => 400
  :status_line       => "HTTP/1.1 400 Bad Request\r\n"

response_call(/srv/app/lello/vendor/bundle/ruby/3.1.0/gems/excon-0.104.0/lib/excon/middlewares/expects.rb:13)

response_call(/srv/app/lello/vendor/bundle/ruby/3.1.0/gems/ddtrace-1.11.0/lib/datadog/tracing/contrib/excon/middleware.rb:47)

response_call(/srv/app/lello/vendor/bundle/ruby/3.1.0/gems/excon-0.104.0/lib/excon/middlewares/response_parser.rb:12)

response(/srv/app/lello/vendor/bundle/ruby/3.1.0/gems/excon-0.104.0/lib/excon/connection.rb:460)

request(/srv/app/lello/vendor/bundle/ruby/3.1.0/gems/excon-0.104.0/lib/excon/connection.rb:291)

request(/srv/app/lello/vendor/bundle/ruby/3.1.0/gems/fog-core-2.3.0/lib/fog/core/connection.rb:79)

request(/srv/app/lello/vendor/bundle/ruby/3.1.0/gems/fog-xml-0.1.4/lib/fog/xml/connection.rb:9)

_request(/srv/app/lello/vendor/bundle/ruby/3.1.0/gems/fog-aws-3.21.0/lib/fog/aws/storage.rb:686)

request(/srv/app/lello/vendor/bundle/ruby/3.1.0/gems/fog-aws-3.21.0/lib/fog/aws/storage.rb:681)

put_object(/srv/app/lello/vendor/bundle/ruby/3.1.0/gems/fog-aws-3.21.0/lib/fog/aws/requests/storage/put_object.rb:48)

save(/srv/app/lello/vendor/bundle/ruby/3.1.0/gems/fog-aws-3.21.0/lib/fog/aws/models/storage/file.rb:283)

create(/srv/app/lello/vendor/bundle/ruby/3.1.0/gems/fog-core-2.3.0/lib/fog/core/collection.rb:50)

put_file(/srv/app/lello/lib/fog_storage_handler.rb:21)

upload_csv_to_s3(/srv/app/lello/app/services/partners/bulk_uploads/notable_listing_request_creator.rb:137)

save(/srv/app/lello/app/services/partners/bulk_uploads/notable_listing_request_creator.rb:15)

create(/srv/app/lello/app/controllers/partners/radmin/bulk_notable_listing_requests_controller.rb:93)
geemus commented 7 months ago

@wpeterson Hey, finally trying to circle back to this now that I'm back to work. I strongly suspect that if you lock to that earlier fog-aws release it should avoid the issue (I wouldn't think the excon changes would be impactful for this in particular, but definitely let me know if that seems not to be the case).

geemus commented 7 months ago

@wpeterson Thanks for the stack-trace, that is helpful. For instance, we can see that it goes to put_object, which is the simpler non-multipart version (and should indicate that the file is less than the chunk size).

geemus commented 7 months ago

A couple of follow up questions to help narrow this down when you get the chance:

  1. If you leave the fog-aws version, but bump excon, what happens? (trying to further narrow down which changes might be in play).
  2. Relatedly, if it is the fog change, does adding { retry_limit: 5, retry_interval: 1 } to connection options fix it?

Sorry I don't have a ready answer, it's kind of a weird case and definitely something that we only see through a lot of layers of indirection, so just trying to narrow things down a bit more. Thanks.

github-actions[bot] commented 5 months ago

This issue has been marked inactive and will be closed if no further activity occurs.