aio-libs / aiohttp

Asynchronous HTTP client/server framework for asyncio and Python
https://docs.aiohttp.org
Other
14.95k stars 1.99k forks source link

`resp.content.read(chunk_size)` returns HTTP headers instead of just body. #3329

Closed bradwood closed 5 years ago

bradwood commented 5 years ago

Long story short

I am doing a chunked download from a Range enabled HTTP server and it appears to be including HTTP headers, rather than just the body.

Am i using the library incorrectly, or is this a bug?

How do I get only chunks of the body, excluding the HTTP headers and the --boundary delimiter?

Thanks!

Expected behaviour

I expected that only pieces of the HTTP body would be returned when calling resp.content.read(chunk_size)

Actual behaviour

The chunks come down correctly, but the headers and boundary delimiters are present in the resulting file.

Steps to reproduce

Here is the code in question:

    async def fetch(self,
                    *,
                    timeout: int = 60, # sec
                    chunk_size: int = 1048576 # = 1 Mb
                    ) -> None:
        """Fetch the Listings XML file."""
        LOGGER.debug(f'Fetch({self}) called started.')
        to_ = ClientTimeout(total=timeout)
        async with ClientSession(timeout=to_) as session:
            LOGGER.debug(f'Fetch: Inside ClientSession()')
            LOGGER.debug(f'Fetch: About to fetch url={self._url}')
            async with session.get(self._url) as resp:
                LOGGER.debug(f'Fetch: Inside session.get(url={self._url})')
                with open(self._full_path, 'wb') as file_desc:
                    while True:
                        LOGGER.debug(f'Fetch: Inside file writing loop. filename={self._full_path}')
                        chunk = await resp.content.read(chunk_size)
                        if not chunk:
                            break
                        LOGGER.debug('Fetch: Got a chunk')
                        file_desc.write(chunk)
                        LOGGER.debug('Fetch: Wrote the chunk')

        LOGGER.debug(f'Fetch() call finished on {self}')

and here is the head of the resulting file:

(pyskyq-4vSEKDfZ) ✔ [brad@bradmac:~/Code/pyskyq] [31-epg-enh|✚ 2] $ head .epg_data/ea51e77b9fdede19528d599f50182d37edcdbc082b06358146041fe446f6a855.xml
--boundary
Content-Type: application/xml
Content-Disposition: attachment; filename="6729.xml"; filename*=utf-8''6729.xml
Content-Length: 10781354

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE tv SYSTEM "xmltv.dtd">
<tv generator-info-name="xmltv.co.uk" source-info-name="xmltv.co.uk">
  <channel id="003b31fb0fd63bd8fd171c7d7a1d0249">
    <display-name>GEO News</display-name>
(pyskyq-4vSEKDfZ) ✔ [brad@bradmac:~/Code/pyskyq] [31-epg-enh|✚ 2] $

Your environment

(pyskyq-4vSEKDfZ) ✔ [brad@bradmac:~/Code/pyskyq] [31-epg-enh|✚ 2] $ python -V
Python 3.7.0
(pyskyq-4vSEKDfZ) ✔ [brad@bradmac:~/Code/pyskyq] [31-epg-enh|✚ 2] $ pip freeze | grep aio
aiohttp==3.4.4
(pyskyq-4vSEKDfZ) ✔ [brad@bradmac:~/Code/pyskyq] [31-epg-enh|✚ 2] $
aio-libs-bot commented 5 years ago

GitMate.io thinks the contributor most likely able to help you is @asvetlov.

Possibly related issues are https://github.com/aio-libs/aiohttp/issues/2711 (No content), https://github.com/aio-libs/aiohttp/issues/2062 (Content-Length header), https://github.com/aio-libs/aiohttp/issues/2183 ('None' in HTTP headers), https://github.com/aio-libs/aiohttp/issues/813 (Why uppercase HTTP headers?), and https://github.com/aio-libs/aiohttp/issues/14 (HttpResponse doesn't parse response body without Content-Length header and Connection: close).

asvetlov commented 5 years ago

It's not a chunked encoded body but multipart/form-data encoded form. Please use MultipartReader(resp.headers, resp.content) to extract form data.

bradwood commented 5 years ago

Its not form data. Its a large XML payload.

asvetlov commented 5 years ago

Check resp.headers. Your log looks like a multipart message with a large XML payload inside

bradwood commented 5 years ago

Sorry, I'm confused.

I want the body, not the headers. Essentially, I want to be able to loop over body chunks to write out the data file, without headers.

bradwood commented 5 years ago

Here is the code (a test using aresponses) that is mocking the server, if that helps

@pytest.mark.asyncio
async def test_listing_fetch(aresponses):

    # custom handler to respond with chunks
    async def my_handler(request):
        LOGGER.debug('in handler')
        my_boundary = 'boundary'
        xmlfile_path = Path(__file__).resolve().parent.joinpath('6729.xml')
        LOGGER.debug('xml file path = {xmlfile_path}')
        resp = aresponses.Response(status=200,
                                   reason='OK',
                                   )
        resp.enable_chunked_encoding()
        await resp.prepare(request)

        xmlfile = open(xmlfile_path, 'rb')

        LOGGER.debug('opened xml file for serving')
        with MultipartWriter('application/xml', boundary=my_boundary) as mpwriter:
            mpwriter.append(xmlfile)
            LOGGER.debug('appended chunk')
            await mpwriter.write(resp, close_boundary=False)
            LOGGER.debug('wrote chunk')

        xmlfile.close()
        return resp

    aresponses.add('foo.com', '/feed/6715', 'get', response=my_handler)

    with isolated_filesystem():
        l = Listing('http://foo.com/feed/6715')
        await l.fetch()
        assert l._path.joinpath(l._filename).is_file()
asvetlov commented 5 years ago

Please read about multipart encoding first: https://en.wikipedia.org/wiki/MIME#Multipart_messages

Your mocked server is invalid: application/xml is for the entire xml content, not for multiparts.

P.S. A thing you call chunk is a multipart's part actually. The work chunk is used for another concept, at least in HTTP protocol.

bradwood commented 5 years ago

So how to make a server then that emulates support for Range headers?

Here is a HEAD request on the server I'm trying to emulate:

HTTP/1.1 200 OK
Accept-Ranges: bytes
Connection: keep-alive
Content-Encoding: gzip
Content-Type: application/xml
Date: Sun, 07 Oct 2018 23:11:56 GMT
ETag: "f8889f-577999e0b6f7d-gzip"
Last-Modified: Sun, 07 Oct 2018 01:42:28 GMT
Server: nginx/1.11.10
Vary: Accept-Encoding

How can I make aiohttp behave like that? If it's in the docs, then maybe I missed it, or go confused between multipart and "streaming".

Thanks for your help.

bradwood commented 5 years ago

It should respond like this when a Range: request is given:

(pyskyq-4vSEKDfZ) ✘-INT [brad@bradmac:~/Code/pyskyq/tests] [31-epg-enh|✚ 2] $ curl http://www.xmltv.co.uk/feed/6715 -i -H "Range: bytes=0-1023"
HTTP/1.1 206 Partial Content
Server: nginx/1.11.10
Date: Mon, 08 Oct 2018 07:08:53 GMT
Content-Type: application/xml
Content-Length: 1024
Connection: keep-alive
Last-Modified: Mon, 08 Oct 2018 01:42:20 GMT
ETag: "f9f199-577adbb5d510e"
Accept-Ranges: bytes
Vary: Accept-Encoding
Content-Range: bytes 0-1023/16380313

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE tv SYSTEM "xmltv.dtd">
<tv generator-info-name="xmltv.co.uk" source-info-name="xmltv.co.uk">
  <channel id="003b31fb0fd63bd8fd171c7d7a1d0249">
    <display-name>GEO News</display-name>
  </channel>
  <channel id="0092ad6b181b813d9e2ceed1cfbf5bf1">
    <display-name>Notts TV</display-name>
  </channel>
  <channel id="00da025711e82cf319cb488d5988c099">
    <display-name>Dunya News</display-name>
  </channel>

Is this type of server supported in aiohttp? using Multipart* objects? or Stream*? I have been digging through the docs for this but it's not clear.

asvetlov commented 5 years ago

The latest response is neither streaming nor multipart.

It is just a regular response with truncated body: web.Response(status=206, headers={<fill them itself>}, body=xml_bytes[:1000]).

I'm closing the issue because it is not about aiohttp bugs/improvements but teaching @bradwood HTTP protocol.

Please use another site for it. Maybe StackOverflow fits better.

bradwood commented 5 years ago

I don't need to be taught about the HTTP protocol on this forum. @asvetlov. I am perfectly capable of reading wikipedia and RFCs just like you.

I am asking about aiohttp support for this. Does it support it, or not? Please refer me to the documentation, if so, or tell me that it doesn't.

Did you not read this?

Is this type of server supported in aiohttp? using Multipart objects? or Stream? I have been digging through the docs for this but it's not clear.

FWIW, while I may have made a mistake in interpretation earlier, I don't appreciate your comment about teaching me HTTP. There is no need for rudeness.

I've been reading your responses to many people on this forum - you are extremely rude with many of them. You like to tell them to read wikipedia, instead of actually being helpful. Its condescending and unhelpful. In many cases, these questions are as a result of poorly documented examples of how aiohttp implements, or doesn't, a particular feature, not the protocol itself.

Look, don't get me wrong, I appreciate your contribution to the community, but it would be much better if (a) the docs were improved so that answers could be found without raising tickets and (b) if you were less dismissing and insulting to people who have legitimate questions about the codebase, not the protocol.

asvetlov commented 5 years ago
  1. aiohttp request supports request.range property to help Range HTTP header parsing. It supports ranged requests in static file serving. The library doesn't provide a magic helper for returning a ranging response for arbitrary data -- a user should construct this response manually.

  2. The main github tracker mission is the development of aiohttp, not for aiohttp usage. For example, CPython itself forbids questions about Python usage in its bug tracker and python-dev mailing list. Should we enable the same policy for aiohttp? I don't know, but this tracker is not a place for general questions. It is not a forum or questions-and-answers resource.

  3. We have a different understanding of rudeness. Pointing on a helpful resource for future reading is a good response for me. RTFM and so far. If it is not enough for you -- that's fine. Please use another sits like stackoverflow.com for asking the usage questions.

  4. The documentation is never perfect. It always can be improved. Please make Pull Request(s) for documentation improvement. I very appreciate it.

bradwood commented 5 years ago
  1. aiohttp request supports request.range property to help Range HTTP header parsing. It supports ranged requests in static file serving. The library doesn't provide a magic helper for returning a ranging response for arbitrary data -- a user should construct this response manually.

Ok great -- this is helpful - I will do that. I thought there might be a higher level API that did this, as it the case for Streams and Multipart -- so not an unreasonable question IMHO.

  1. The main github tracker mission is the development of aiohttp, not for aiohttp usage. For example, CPython itself forbids questions about Python usage in its bug tracker and python-dev mailing list. Should we enable the same policy for aiohttp? I don't know, but this tracker is not a place for general questions. It is not a forum or questions-and-answers resource.

Ok, well initially I thought it was a bug, rather than a usage query and I'd assert that insofar as the way in which something can be used, or not used, is part of it's development agenda. If you make something that is difficult to use, or understand, then surely it's a (usability) bug?

  1. We have a different understanding of rudeness. Pointing on a helpful resource for future reading is a good response for me. RTFM and so far. If it is not enough for you -- that's fine. Please use another sits like stackoverflow.com for asking the usage questions.

Ok -- fair enough... I think the Robustness Principle should apply here too... I honestly thought this was a bug/weakness in the API which was a legitimate query. While I don't know every HTTP RFC by heart, I do think I know enough about it to ask relevant questions about aiohttp's implementation of bits of it. So being told that you are not going to "teach someone HTTP" is a pretty blunt response to a legitimate query.

  1. The documentation is never perfect. It always can be improved. Please make Pull Request(s) for documentation improvement. I very appreciate it.

When time permits, and I've got a working example for this topic, I'll try to do exactly that.

asvetlov commented 5 years ago

Sorry for my attitude and thanks for understanding.

lock[bot] commented 4 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. If you feel like there's important points made in this discussion, please include those exceprts into that new issue.