Gerapy / GerapyPlaywright

Downloader Middleware to support Playwright in Scrapy & Gerapy
106 stars 24 forks source link

出现gzip.BadGzipFile: Not a gzipped file (b'<!') 的解决办法。 一处bug #14

Open legend-zl opened 2 years ago

legend-zl commented 2 years ago

如何爬去的一个网站返回的response里面的headers包含了 content-encoding: "gzip"的话,那么就会报上述错误,虽然作者在 downloadermiddlewares.py 的代码段中去掉了这个属性:

Necessary to bypass the compression middleware

        # 这个地方只能去掉 headers 中的content-encoding,但是response.headers中的依然存在,所以下面应该直接改为  headers=headers,
        headers = response.headers
        headers.pop('content-encoding', None)
        headers.pop('Content-Encoding', None)

        response = HtmlResponse(
            page.url,
            status=response.status,
            headers=response.headers,    # 解决办法就是改为: headers=headers, 
            body=content,
            encoding='utf-8',
            request=request
        )

但是很可惜的是,去不掉,只有把 headers=response.headers, 改为headers才可以。

legend-zl commented 2 years ago

注释 是我添加上去的

tangyuanba commented 2 years ago

感谢你的解决方案, 我发现在调用HtmlResponse之后进行删除操作,就可以返回正确的response

response = HtmlResponse( page.url, status=response.status, headers=response.headers, body=content, encoding='utf-8', request=request )

headers.pop('content-encoding', None) headers.pop('Content-Encoding', None)

yswtrue commented 2 years ago

我把scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware这个中间件去了也可以