benoitc / gunicorn

gunicorn 'Green Unicorn' is a WSGI HTTP Server for UNIX, fast clients and sleepy applications.
http://www.gunicorn.org
Other
9.87k stars 1.75k forks source link

Why is it necessary to check if the first two bytes of data are \r\n? #3300

Open kwsy opened 1 month ago

kwsy commented 1 month ago

`def parse(self, unreader): buf = io.BytesIO() self.get_data(unreader, buf, stop=True)

  # get request line
  line, rbuf = self.read_line(unreader, buf, self.limit_request_line)

  # proxy protocol
  if self.proxy_protocol(bytes_to_str(line)):
      # get next request line
      buf = io.BytesIO()
      buf.write(rbuf)
      line, rbuf = self.read_line(unreader, buf, self.limit_request_line)

  self.parse_request_line(line)           # 解析请求行, 也就是第一行
  buf = io.BytesIO()
  buf.write(rbuf)

  # Headers   接下来解析headers
  data = buf.getvalue()
  idx = data.find(b"\r\n\r\n")

  done = data[:2] == b"\r\n"
  while True:
      idx = data.find(b"\r\n\r\n")
      done = data[:2] == b"\r\n"    # 这条语句的依据是什么?

      if idx < 0 and not done:
          self.get_data(unreader, buf)
          data = buf.getvalue()
          if len(data) > self.max_buffer_headers:
              raise LimitRequestHeaders("max buffer headers")
      else:
          break

  if done:
      self.unreader.unread(data[2:])
      return b""

  self.headers = self.parse_headers(data[:idx])

  ret = data[idx + 4:]    # body的部分
  buf = None
  return ret

`

parse is the method of Request class , It reads the message header part of the HTTP request. There is always \r\n\r\n between the message header and the message body. However, the method checks whether the first two bytes of data are \r\n. This is strange. Isn't the end marker of the message header only \r\n\r\n? Why, even if \r\n\r\n is not found, is the message header considered to have ended when data starts with \r\n? Is there any basis for doing this?

pajod commented 1 month ago

Why, even if \r\n\r\n is not found, is the message header considered to have ended?

Because on reading the request (and PROXY) line, its trailing \r\n was discarded from buf. Thus done distinguishes a) expecting two consecutive \r\n marking the last header and b) just expecting a single \r\n at the start to signal a HTTP/1.0 request with zero headers.

The code could be refactored to remove most of the complexity and be more efficient for invalid/oversize/partial input - but I do not see a logical flaw. If you do, please show the request that you consider incorrectly parsed.