Cadair / parfive

An asyncio based parallel file downloader for Python 3.8+
https://parfive.readthedocs.io/
MIT License
50 stars 24 forks source link

Download fails with 400, message='Can not decode content-encoding: gzip' #121

Open wtbarnes opened 1 year ago

wtbarnes commented 1 year ago

When trying to download a (seemingly) plain text file, the download fails and I'm getting the following error,

[<parfive.results.Error object at 0x1078a66d0>
https://sohoftp.nascom.nasa.gov/solarsoft/sdo/aia/response/aia_V3_error_table.txt
 400, message='Can not decode content-encoding: gzip']

The code to reproduce this is,

import parfive
dl = parfive.Downloader()
dl.enqueue_file('https://sohoftp.nascom.nasa.gov/solarsoft/sdo/aia/response/aia_V3_error_table.txt', path='.')
foo = dl.download()

parfive version: 2.0.2 Python version: 3.9 OS: macOS 12.6.2


Solution detailed here: https://github.com/Cadair/parfive/issues/121#issuecomment-1379797688

Cadair commented 1 year ago

Can you give me a full print out of the http response headers? (Either with httpie/curl or debug logging?)

alasdairwilson commented 1 year ago

If it helps, using aiohttp directly works:

import aiohttp
import asyncio

async def main():

        async with aiohttp.ClientSession() as session:
            async with session.get('https://sohoftp.nascom.nasa.gov/solarsoft/sdo/aia/response/aia_V9_20200706_215452_response_table.txt') as response 

                print("Status:", response.status)
                print([f"{key}: {response.headers[key]}" for key in response.headers])

                html = await response.text()
                print("Body:", html, "...")
asyncio.run(main())
'Date: Tue, 10 Jan 2023 20:19:41 GMT'
'Server: Apache'
'Strict-Transport-Security: max-age=31536000; includeSubdomains;'
'Last-Modified: Tue, 06 Jul 2021 21:57:00 GMT'
'Etag: "4d58-5c67b8056c300-gzip"'
'Accept-Ranges: bytes'
'Vary: Accept-Encoding'
'Content-Encoding: gzip'
'Content-Length: 2156'
'Content-Type: text/plain'
Body:                       DATE                   T_START                    T_STOP    VER_NUM   WAVE_STR   WAVELNTH     EPERDN   DNPERPHT   EFF_AREA   EFF_WVLN    EFFA_P1    EFFA_P2    EFFA_P3       RMSE
wtbarnes commented 1 year ago
$ curl --head https://sohoftp.nascom.nasa.gov/solarsoft/sdo/aia/response/aia_V3_error_table.txt
HTTP/1.1 200 OK
Date: Tue, 10 Jan 2023 21:43:43 GMT
Server: Apache
Strict-Transport-Security: max-age=31536000; includeSubdomains;
Last-Modified: Thu, 27 Sep 2012 13:22:00 GMT
ETag: "72d-4caaed300ae00"
Accept-Ranges: bytes
Content-Length: 1837
Vary: Accept-Encoding
Content-Type: text/plain
wtbarnes commented 1 year ago

The filename in my original post was actually not the right one (fixed now), but I believe both show the same issue.

ayshih commented 1 year ago

Here's the curl request for a gzip-ed response, and piped to gunzip to decompress the response:

$ curl -v -H 'Accept-encoding: gzip' https://sohoftp.nascom.nasa.gov/solarsoft/sdo/aia/response/aia_V3_error_table.txt | gunzip -
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 198.118.248.247:443...
* Connected to sohoftp.nascom.nasa.gov (198.118.248.247) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
*  CAfile: /etc/ssl/certs/ca-certificates.crt
*  CApath: /etc/ssl/certs
* TLSv1.0 (OUT), TLS header, Certificate Status (22):
} [5 bytes data]
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
} [512 bytes data]
* TLSv1.2 (IN), TLS header, Certificate Status (22):
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Server hello (2):
{ [122 bytes data]
* TLSv1.2 (IN), TLS header, Finished (20):
{ [5 bytes data]
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
{ [25 bytes data]
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Certificate (11):
{ [2682 bytes data]
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
{ [264 bytes data]
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Finished (20):
{ [52 bytes data]
* TLSv1.2 (OUT), TLS header, Finished (20):
} [5 bytes data]
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
} [1 bytes data]
* TLSv1.2 (OUT), TLS header, Supplemental data (23):
} [5 bytes data]
* TLSv1.3 (OUT), TLS handshake, Finished (20):
} [52 bytes data]
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server accepted to use http/1.1
* Server certificate:
*  subject: CN=sohoftp.nascom.nasa.gov
*  start date: Dec  6 22:29:40 2022 GMT
*  expire date: Mar  6 22:29:39 2023 GMT
*  subjectAltName: host "sohoftp.nascom.nasa.gov" matched cert's "sohoftp.nascom.nasa.gov"
*  issuer: C=US; O=Let's Encrypt; CN=R3
*  SSL certificate verify ok.
* TLSv1.2 (OUT), TLS header, Supplemental data (23):
} [5 bytes data]
> GET /solarsoft/sdo/aia/response/aia_V3_error_table.txt HTTP/1.1
> Host: sohoftp.nascom.nasa.gov
> User-Agent: curl/7.81.0
> Accept: */*
> Accept-encoding: gzip
>
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
{ [57 bytes data]
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
{ [57 bytes data]
* old SSL session ID is stale, removing
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Date: Wed, 11 Jan 2023 14:42:18 GMT
< Server: Apache
< Strict-Transport-Security: max-age=31536000; includeSubdomains;
< Last-Modified: Thu, 27 Sep 2012 13:22:00 GMT
< ETag: "72d-4caaed300ae00-gzip"
< Accept-Ranges: bytes
< Vary: Accept-Encoding
< Content-Encoding: gzip
< Content-Length: 356
< Content-Type: text/plain
<
{ [356 bytes data]
100   356  100   356    0     0   1096      0 --:--:-- --:--:-- --:--:--  1098
* Connection #0 to host sohoftp.nascom.nasa.gov left intact
                      DATE                   T_START                    T_STOP    VER_NUM   WAVE_STR   WAVELNTH   DNPERPHT   COMPRESS     CALERR    CHIANTI     EVEERR
   2012-02-10T01:12:01.000   2010-05-01T00:00:00.000   2050-05-01T00:00:00.000          3    94_THIN         94      1.975     26.196      0.250      0.500      0.087
   2012-02-10T01:12:01.000   2010-05-01T00:00:00.000   2050-05-01T00:00:00.000          3   131_THIN        131      1.473     18.797      0.250      0.500      0.051
   2012-02-10T01:12:01.000   2010-05-01T00:00:00.000   2050-05-01T00:00:00.000          3   171_THIN        171      1.122     14.400      0.250      0.250      0.019
   2012-02-10T01:12:01.000   2010-05-01T00:00:00.000   2050-05-01T00:00:00.000          3   193_THIN        193      0.962     12.759      0.250      0.250      0.014
   2012-02-10T01:12:01.000   2010-05-01T00:00:00.000   2050-05-01T00:00:00.000          3   211_THIN        211      0.880     11.670      0.250      0.250      0.019
   2012-02-10T01:12:01.000   2010-05-01T00:00:00.000   2050-05-01T00:00:00.000          3   304_THIN        304      0.611      8.100      0.250      0.500      0.023
   2012-02-10T01:12:01.000   2010-05-01T00:00:00.000   2050-05-01T00:00:00.000          3   335_THIN        335      0.576      7.350      0.250      0.250      0.097
   2012-02-10T01:12:01.000   2010-05-01T00:00:00.000   2050-05-01T00:00:00.000          3       1600       1600      0.120      1.539      0.500      1.000      0.012
   2012-02-10T01:12:01.000   2010-05-01T00:00:00.000   2050-05-01T00:00:00.000          3       1700       1700      0.113      0.362      0.500      1.000      0.035
   2012-02-10T01:12:01.000   2010-05-01T00:00:00.000   2050-05-01T00:00:00.000          3       4500       4500      0.056      0.068      0.500      1.000      0.030

That is, the server appears to be returning a valid gzip-ed response, which is consistent with the fact that aiohttp by itself appears to have no problem.

ayshih commented 1 year ago

Now that I've added a stream handler to the parfive logger so that I can actually see the parfive debug logging, it's clearer what is going on. The parfive error is related to the fact that parfive is splitting this already tiny file into super-tiny 72-byte requests. There is no error if you explicitly specify max_splits=1 when instantiating the downloader, which is equivalent to the lack of splitting when using curl or the simple aiohttp example above.

Here's example parfive debug output:

GET request made to https://sohoftp.nascom.nasa.gov/solarsoft/sdo/aia/response/aia_V3_error_table.txt with headers=<CIMultiDictProxy('Host': 'sohoftp.nascom.nasa.gov', 'User-Agent': 'parfive/2.0.2 aiohttp/3.8.3 python/3.10.6', 'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate')>
200 Response received from https://sohoftp.nascom.nasa.gov/solarsoft/sdo/aia/response/aia_V3_error_table.txt with headers=<CIMultiDictProxy('Date': 'Wed, 11 Jan 2023 15:30:19 GMT', 'Server': 'Apache', 'Strict-Transport-Security': 'max-age=31536000; includeSubdomains;', 'Last-Modified': 'Thu, 27 Sep 2012 13:22:00 GMT', 'Etag': '"72d-4caaed300ae00-gzip"', 'Accept-Ranges': 'bytes', 'Vary': 'Accept-Encoding', 'Content-Encoding': 'gzip', 'Content-Length': '356', 'Content-Type': 'text/plain')>
GET request made for download to https://sohoftp.nascom.nasa.gov/solarsoft/sdo/aia/response/aia_V3_error_table.txt with headers=<CIMultiDictProxy('Host': 'sohoftp.nascom.nasa.gov', 'User-Agent': 'parfive/2.0.2 aiohttp/3.8.3 python/3.10.6', 'Range': 'bytes=0-71', 'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate')>
206 Response received from https://sohoftp.nascom.nasa.gov/solarsoft/sdo/aia/response/aia_V3_error_table.txt with headers=<CIMultiDictProxy('Date': 'Wed, 11 Jan 2023 15:30:19 GMT', 'Server': 'Apache', 'Strict-Transport-Security': 'max-age=31536000; includeSubdomains;', 'Last-Modified': 'Thu, 27 Sep 2012 13:22:00 GMT', 'Etag': '"72d-4caaed300ae00-gzip"', 'Accept-Ranges': 'bytes', 'Vary': 'Accept-Encoding', 'Content-Encoding': 'gzip', 'Content-Range': 'bytes 0-71/356', 'Content-Length': '72', 'Content-Type': 'text/plain')>
GET request made for download to https://sohoftp.nascom.nasa.gov/solarsoft/sdo/aia/response/aia_V3_error_table.txt with headers=<CIMultiDictProxy('Host': 'sohoftp.nascom.nasa.gov', 'User-Agent': 'parfive/2.0.2 aiohttp/3.8.3 python/3.10.6', 'Range': 'bytes=213-284', 'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate')>
206 Response received from https://sohoftp.nascom.nasa.gov/solarsoft/sdo/aia/response/aia_V3_error_table.txt with headers=<CIMultiDictProxy('Date': 'Wed, 11 Jan 2023 15:30:19 GMT', 'Server': 'Apache', 'Strict-Transport-Security': 'max-age=31536000; includeSubdomains;', 'Last-Modified': 'Thu, 27 Sep 2012 13:22:00 GMT', 'Etag': '"72d-4caaed300ae00-gzip"', 'Accept-Ranges': 'bytes', 'Vary': 'Accept-Encoding', 'Content-Encoding': 'gzip', 'Content-Range': 'bytes 213-284/356', 'Content-Length': '72', 'Content-Type': 'text/plain')>
1/0 files failed to download. Please check `.errors` for details
https://sohoftp.nascom.nasa.gov/solarsoft/sdo/aia/response/aia_V3_error_table.txt failed to download with exception
400, message='Can not decode content-encoding: gzip'

Note that it appeared to error on just the second 72-byte chunk here. Seemingly sometimes that chunk succeeds, and a different chunk errors. I'm not sure why some chunks fail sometimes.

ayshih commented 1 year ago

Ah, it's actually failing on any chunk other than the first one. Here's what I believe is happening, and it traces back to aiohttp.

Since Content-Type is text/plain, and Accept-Encoding includes gzip, the web server will actually gzip-compress the text file before complying with any request for specific bytes. parfive determines the byte chunks based on Content-Length, which is length of the compressed data, so when parfive sends the request for bytes 0–71 of the file, it actually gets bytes 0–71 of the compressed file, rather than a compressed version of bytes 0–71 of the original file.

Setting aside the fact that the bytes aren't the same, the problem is that each individual chunk is not a complete gzip-compressed payload, but rather just a partial segment of one. However, aiohttp sees Content-Encoding is gzip for each 72-byte chunk and will try to uncompress each chunk separately. For the first chunk, it will somewhat succeed (but gunzip sees the file terminate abruptly). For every other chunk, since they lack the magic bytes at the start, gunzip outright fails.

ayshih commented 1 year ago

Using curl to request bytes 0–71, then piped to gunzip:

$ curl -v -H 'Accept-encoding: gzip' -H 'Range: bytes=0-71' https://sohoftp.nascom.nasa.gov/solarsoft/sdo/aia/response/aia_V3_error_table.txt | gunzip -
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 198.118.248.247:443...
* Connected to sohoftp.nascom.nasa.gov (198.118.248.247) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
*  CAfile: /etc/ssl/certs/ca-certificates.crt
*  CApath: /etc/ssl/certs
* TLSv1.0 (OUT), TLS header, Certificate Status (22):
} [5 bytes data]
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
} [512 bytes data]
* TLSv1.2 (IN), TLS header, Certificate Status (22):
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Server hello (2):
{ [122 bytes data]
* TLSv1.2 (IN), TLS header, Finished (20):
{ [5 bytes data]
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
{ [25 bytes data]
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Certificate (11):
{ [2682 bytes data]
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
{ [264 bytes data]
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Finished (20):
{ [52 bytes data]
* TLSv1.2 (OUT), TLS header, Finished (20):
} [5 bytes data]
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
} [1 bytes data]
* TLSv1.2 (OUT), TLS header, Supplemental data (23):
} [5 bytes data]
* TLSv1.3 (OUT), TLS handshake, Finished (20):
} [52 bytes data]
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server accepted to use http/1.1
* Server certificate:
*  subject: CN=sohoftp.nascom.nasa.gov
*  start date: Dec  6 22:29:40 2022 GMT
*  expire date: Mar  6 22:29:39 2023 GMT
*  subjectAltName: host "sohoftp.nascom.nasa.gov" matched cert's "sohoftp.nascom.nasa.gov"
*  issuer: C=US; O=Let's Encrypt; CN=R3
*  SSL certificate verify ok.
* TLSv1.2 (OUT), TLS header, Supplemental data (23):
} [5 bytes data]
> GET /solarsoft/sdo/aia/response/aia_V3_error_table.txt HTTP/1.1
> Host: sohoftp.nascom.nasa.gov
> User-Agent: curl/7.81.0
> Accept: */*
> Accept-encoding: gzip
> Range: bytes=0-71
>
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
{ [57 bytes data]
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
{ [57 bytes data]
* old SSL session ID is stale, removing
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* Mark bundle as not supporting multiuse
< HTTP/1.1 206 Partial Content
< Date: Wed, 11 Jan 2023 15:38:57 GMT
< Server: Apache
< Strict-Transport-Security: max-age=31536000; includeSubdomains;
< Last-Modified: Thu, 27 Sep 2012 13:22:00 GMT
< ETag: "72d-4caaed300ae00-gzip"
< Accept-Ranges: bytes
< Vary: Accept-Encoding
< Content-Encoding: gzip
< Content-Range: bytes 0-71/356
< Content-Length: 72
< Content-Type: text/plain
<
{ [72 bytes data]
100    72  100    72    0     0   1184      0 --:--:-- --:--:-- --:--:--  1200
* Connection #0 to host sohoftp.nascom.nasa.gov left intact
                      DATE                   T_START
gzip: stdin: unexpected end of file

Using curl to request a later chunk, then piped to gunzip, to mimic parfive and aiohttp:

$ curl -v -H 'Accept-encoding: gzip' -H 'Range: bytes=142-213' https://sohoftp.nascom.nasa.gov/solarsoft/sdo/aia/response/aia_V3_error_table.txt | gunzip -
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 198.118.248.247:443...
* Connected to sohoftp.nascom.nasa.gov (198.118.248.247) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
*  CAfile: /etc/ssl/certs/ca-certificates.crt
*  CApath: /etc/ssl/certs
* TLSv1.0 (OUT), TLS header, Certificate Status (22):
} [5 bytes data]
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
} [512 bytes data]
* TLSv1.2 (IN), TLS header, Certificate Status (22):
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Server hello (2):
{ [122 bytes data]
* TLSv1.2 (IN), TLS header, Finished (20):
{ [5 bytes data]
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
{ [25 bytes data]
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Certificate (11):
{ [2682 bytes data]
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
{ [264 bytes data]
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Finished (20):
{ [52 bytes data]
* TLSv1.2 (OUT), TLS header, Finished (20):
} [5 bytes data]
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
} [1 bytes data]
* TLSv1.2 (OUT), TLS header, Supplemental data (23):
} [5 bytes data]
* TLSv1.3 (OUT), TLS handshake, Finished (20):
} [52 bytes data]
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server accepted to use http/1.1
* Server certificate:
*  subject: CN=sohoftp.nascom.nasa.gov
*  start date: Dec  6 22:29:40 2022 GMT
*  expire date: Mar  6 22:29:39 2023 GMT
*  subjectAltName: host "sohoftp.nascom.nasa.gov" matched cert's "sohoftp.nascom.nasa.gov"
*  issuer: C=US; O=Let's Encrypt; CN=R3
*  SSL certificate verify ok.
* TLSv1.2 (OUT), TLS header, Supplemental data (23):
} [5 bytes data]
> GET /solarsoft/sdo/aia/response/aia_V3_error_table.txt HTTP/1.1
> Host: sohoftp.nascom.nasa.gov
> User-Agent: curl/7.81.0
> Accept: */*
> Accept-encoding: gzip
> Range: bytes=142-213
>
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
{ [57 bytes data]
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
{ [57 bytes data]
* old SSL session ID is stale, removing
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* Mark bundle as not supporting multiuse
< HTTP/1.1 206 Partial Content
< Date: Wed, 11 Jan 2023 15:39:10 GMT
< Server: Apache
< Strict-Transport-Security: max-age=31536000; includeSubdomains;
< Last-Modified: Thu, 27 Sep 2012 13:22:00 GMT
< ETag: "72d-4caaed300ae00-gzip"
< Accept-Ranges: bytes
< Vary: Accept-Encoding
< Content-Encoding: gzip
< Content-Range: bytes 142-213/356
< Content-Length: 72
< Content-Type: text/plain
<
{ [72 bytes data]
100    72  100    72    0     0    947      0 --:--:-- --:--:-- --:--:--   960
* Connection #0 to host sohoftp.nascom.nasa.gov left intact

gzip: stdin: not in gzip format
ayshih commented 1 year ago

Here's the modified version of @alasdairwilson's example to show that the problem is with aiohttp, not with parfive:

>>> import asyncio
>>> import aiohttp
>>>
>>> async def main():
...     async with aiohttp.ClientSession(headers={'Range': 'bytes=142-213'}) as session:
...         async with session.get('https://sohoftp.nascom.nasa.gov/solarsoft/sdo/aia/response/aia_V3_error_table.txt') as response:
...             print("\n".join([f"{key}: {response.headers[key]}" for key in response.headers]))
...             html = await response.text()
...             print(f"Body:\n{html}")
...
>>> asyncio.run(main())
Date: Wed, 11 Jan 2023 21:13:24 GMT
Server: Apache
Strict-Transport-Security: max-age=31536000; includeSubdomains;
Last-Modified: Thu, 27 Sep 2012 13:22:00 GMT
Etag: "72d-4caaed300ae00-gzip"
Accept-Ranges: bytes
Vary: Accept-Encoding
Content-Encoding: gzip
Content-Range: bytes 142-213/356
Content-Length: 72
Content-Type: text/plain
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\ayshih\AppData\Local\mambaforge\envs\test4\lib\asyncio\runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "C:\Users\ayshih\AppData\Local\mambaforge\envs\test4\lib\asyncio\base_events.py", line 646, in run_until_complete
    return future.result()
  File "<stdin>", line 5, in main
  File "C:\Users\ayshih\AppData\Local\mambaforge\envs\test4\lib\site-packages\aiohttp\client_reqrep.py", line 1081, in text
    await self.read()
  File "C:\Users\ayshih\AppData\Local\mambaforge\envs\test4\lib\site-packages\aiohttp\client_reqrep.py", line 1037, in read
    self._body = await self.content.read()
  File "C:\Users\ayshih\AppData\Local\mambaforge\envs\test4\lib\site-packages\aiohttp\streams.py", line 349, in read
    raise self._exception
aiohttp.client_exceptions.ClientPayloadError: 400, message='Can not decode content-encoding: gzip'
ayshih commented 1 year ago

Okay, okay, okay. It is possible to turn off the automatic decompression of gzip-compressed responses by aiohttp by specifying (surprise) auto_decompress=False. Thus, since parfive is the one that is insisting on partial downloads, it should be the one to fix this issue. parfive needs to set auto_decompress=False, stitch together the partial responses via aiohttp into a single gzip-compressed response, and then decompress that single payload itself.

dreamflasher commented 1 year ago

Does this also happen when you set max_splits=1?

Just tried it out and that works (does not throw an exception).

ayshih commented 1 year ago

The download succeeds with max_splits=1. See my earlier comment.