com-lihaoyi / requests-scala

A Scala port of the popular Python Requests HTTP client: flexible, intuitive, and straightforward to use.
Other
730 stars 86 forks source link

fix: content type header not present in multipart item #154

Closed Andrapyre closed 8 months ago

Andrapyre commented 8 months ago

Background

Content-Type headers are not present in multipart item file uploads. This can be difficult to see quickly, as some servers (like httpbin.org) automatically parse files to the correct content type, so that there is no noticeable effect. Some servers, however, require the content type to be present, which can lead to unexpected and often cryptic errors.

Reproduce

  1. Create a Scala project with the "com.lihaoyi" %% "requests" % "0.8.0" dependency and with a file containing the code below:
import requests.{MultiItem, MultiPart, Session}

import java.io.File

object MultipartTrial {
  def uploadFile(): Unit = {
    val session: Session = requests.Session(
      readTimeout = 5000,
      connectTimeout = 5000,
      maxRedirects = 0,
    )

    val path = getClass.getResource("/readme.zip").getPath
    val file = new File(path)

    val fileKey = "zipped"

    session.post(
      "http://localhost:3000/file",
      data = MultiPart(
        MultiItem(
          fileKey,
          file,
          file.getName
        )
      )
    )
  }
}
  1. Compress a readme file (or any text file) to a readme.zip file and add this file to your project resources.
  2. Start a webserver (node js fastify with the fastify multipart plugin is a great choice) on your local machine, with a port open at localhost:3000. Make sure that this webserver is set up to receive multipart upload post requests at the /file path.
  3. Download Wireshark and set it up to capture traffic on the "Loopback: lo" interface.
  4. Run MultipartTrial.upload() and inspect the http packets. You should find a packet containing something like the following, without any content type header:
    ...
    0110   48 6f 73 74 3a 20 6c 6f 63 61 6c 68 6f 73 74 3a   Host: localhost:
    0120   33 30 30 30 0d 0a 41 63 63 65 70 74 3a 20 74 65   3000..Accept: te
    0130   78 74 2f 68 74 6d 6c 2c 20 69 6d 61 67 65 2f 67   xt/html, image/g
    0140   69 66 2c 20 69 6d 61 67 65 2f 6a 70 65 67 2c 20   if, image/jpeg, 
    0150   2a 3b 20 71 3d 2e 32 2c 20 2a 2f 2a 3b 20 71 3d   *; q=.2, */*; q=
    0160   2e 32 0d 0a 43 6f 6e 6e 65 63 74 69 6f 6e 3a 20   .2..Connection: 
    0170   6b 65 65 70 2d 61 6c 69 76 65 0d 0a 43 6f 6e 74   keep-alive..Cont
    0180   65 6e 74 2d 4c 65 6e 67 74 68 3a 20 31 30 32 31   ent-Length: 1021
    0190   0d 0a 0d 0a 2d 2d 65 39 63 62 35 63 34 37 2d 38   ....--e9cb5c47-8
    01a0   34 34 35 2d 34 34 63 66 2d 62 65 39 36 2d 30 37   445-44cf-be96-07
    01b0   64 65 63 39 65 64 35 61 30 30 0d 0a 43 6f 6e 74   dec9ed5a00..Cont
    01c0   65 6e 74 2d 44 69 73 70 6f 73 69 74 69 6f 6e 3a   ent-Disposition:
    01d0   20 66 6f 72 6d 2d 64 61 74 61 3b 20 6e 61 6d 65    form-data; name
    01e0   3d 22 62 75 6e 64 6c 65 22 3b 20 66 69 6c 65 6e   ="bundle"; filen
    01f0   61 6d 65 3d 22 72 65 61 64 6d 65 2e 7a 69 70 22   ame="readme.zip"
    ...

Description

This PR adds the correct Content-Type headers to all multipart item file uploads. The result is that the previous request, illustrated above, will produce something similar to the following, this time with the correct content type header:

...
0110   48 6f 73 74 3a 20 6c 6f 63 61 6c 68 6f 73 74 3a   Host: localhost:
0120   33 30 30 30 0d 0a 41 63 63 65 70 74 3a 20 74 65   3000..Accept: te
0130   78 74 2f 68 74 6d 6c 2c 20 69 6d 61 67 65 2f 67   xt/html, image/g
0140   69 66 2c 20 69 6d 61 67 65 2f 6a 70 65 67 2c 20   if, image/jpeg, 
0150   2a 3b 20 71 3d 2e 32 2c 20 2a 2f 2a 3b 20 71 3d   *; q=.2, */*; q=
0160   2e 32 0d 0a 43 6f 6e 6e 65 63 74 69 6f 6e 3a 20   .2..Connection: 
0170   6b 65 65 70 2d 61 6c 69 76 65 0d 0a 43 6f 6e 74   keep-alive..Cont
0180   65 6e 74 2d 4c 65 6e 67 74 68 3a 20 31 30 36 31   ent-Length: 1061
0190   0d 0a 0d 0a 2d 2d 31 38 39 31 33 66 34 34 2d 63   ....--18913f44-c
01a0   37 64 66 2d 34 65 33 38 2d 62 38 65 34 2d 61 65   7df-4e38-b8e4-ae
01b0   38 62 36 33 61 34 31 64 64 63 0d 0a 43 6f 6e 74   8b63a41ddc..Cont
01c0   65 6e 74 2d 54 79 70 65 3a 20 61 70 70 6c 69 63   ent-Type: applic
01d0   61 74 69 6f 6e 2f 6f 63 74 65 74 2d 73 74 72 65   ation/octet-stre
01e0   61 6d 0d 0a 43 6f 6e 74 65 6e 74 2d 44 69 73 70   am..Content-Disp
01f0   6f 73 69 74 69 6f 6e 3a 20 66 6f 72 6d 2d 64 61   osition: form-da
0200   74 61 3b 20 6e 61 6d 65 3d 22 62 75 6e 64 6c 65   ta; name="bundle
0210   22 3b 20 66 69 6c 65 6e 61 6d 65 3d 22 72 65 61   "; filename="rea
0220   64 6d 65 2e 7a 69 70 22 0d 0a 0d 0a 50 4b 03 04   dme.zip"....PK..
...

Important Note

After writing tests for this bugfix, I discovered that httpbin.org automatically adds the correct content type header to the request, if no header is present. This generates false positives for all tests that are checking for the application/octet-stream content type header on the multipart item. I have left these tests with a TODO comment for a future fix. Testing these properly will require a different testing strategy and I didn't want to introduce that with this PR.

Andrapyre commented 8 months ago

@lihaoyi , could you take a look at this?

lihaoyi commented 8 months ago

@Andrapyre could you explain why the current code does not already do this? I see MultiPartFormRequestBlob already has the following:

    override def headers = Seq(
      "Content-Type" -> s"multipart/form-data; boundary=$boundary"
    )

Is there some reason that isn't working?

Andrapyre commented 8 months ago

@lihaoyi , yes, the file upload request should look something like the following:

POST /upload HTTP/1.1
Content-Type: multipart/form-data; boundary=3b412ae8-c491-4988-8299-0aa83a06d342
Accept: */*
Authorization: auth_token
Accept-Encoding: gzip, deflate
User-Agent: requests-scala
Cache-Control: no-cache
Pragma: no-cache
Host: host
Connection: keep-alive
Content-Length: 1082

--3b412ae8-c491-4988-8299-0aa83a06d342
Content-Type: application/octet-stream
Content-Length: 865
Content-Disposition: form-data; name=“fileKey”; filename=“license.zip"

…long set of bytes…
--3b412ae8-c491-4988-8299-0aa83a06d342--

So with two different header groups and content types - one for the overall request (multipart upload) and one for the file itself (application/octet-stream). The second header is being omitted. While some servers automatically parse it anyway and retrieve the correct content type, others don't, with weird, unpredictable errors as the result.

This PR fixes that. I realized I could get a much simpler implementation of this, but didn't have time until today. I've also created a simple unit test to check for this going forward.

lihaoyi commented 8 months ago

Got it, thanks!