Boris-code / feapder

🚀🚀🚀feapder is an easy to use, powerful crawler framework | feapder是一款上手简单,功能强大的Python爬虫框架。内置AirSpider、Spider、TaskSpider、BatchSpider四种爬虫解决不同场景的需求。且支持断点续爬、监控报警、浏览器渲染、海量数据去重等功能。更有功能强大的爬虫管理系统feaplat为其提供方便的部署及调度
http://feapder.com
Other
2.9k stars 479 forks source link

网页返回编码错误 #264

Open ruikai0103 opened 2 weeks ago

ruikai0103 commented 2 weeks ago

需知

升级feapder,保证feapder是最新版,若BUG仍然存在,则详细描述问题

pip install --upgrade feapder

问题 在使用feapder请求网址,https://www.bookschina.com/8342179.htm 的时候 用requests请求返回的数据是正常的 但是使用feapder请求的网页数据 字符串部分就是乱码 并且 在请求的时候使用了参数 auto_request=False 然后在回调中手动用requests请求,返回的数据是正常的,但是使用 response = feapder.Response(response) 把Response转换之后 字符串就开始乱码。 已经尝试过吧 resposen.code = "utf-8" 和 gb231 都是不可以的。 截图 image

代码


headers = {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
    "Accept-Language": "zh-CN,zh;q=0.9",
    "Cache-Control": "no-cache",
    "Connection": "keep-alive",
    "Pragma": "no-cache",
    "Sec-Fetch-Dest": "document",
    "Sec-Fetch-Mode": "navigate",
    "Sec-Fetch-Site": "none",
    "Sec-Fetch-User": "?1",
    "Upgrade-Insecure-Requests": "1",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
    "sec-ch-ua": "\"Chromium\";v=\"124\", \"Google Chrome\";v=\"124\", \"Not-A.Brand\";v=\"99\"",
    "sec-ch-ua-mobile": "?0",
    "sec-ch-ua-platform": "\"Windows\""
}

class AirSpiderDemo(feapder.AirSpider):
    def start_requests(self):
        url = "https://www.bookschina.com/8342179.htm"
        yield feapder.Request(url, method="GET", auto_request=False)

    def download_midware(self, request):
        request.headers = headers
        request.proxies = sui_dao_proxies()
        request.cookies = {
            # "BookUser": "1%7c2e9892dc-7c4f-47d1-95f0-2ebd328c90bf%7c1%7c0%7c638620898507693730%7c20180722%7c337457b7db499919",
            # "UserSign": "069f073dff21b10b",
            # "ASP.NET_SessionId": "rrwxo4jepzlcbw5yy0h2jw4y",
            # "UserUnionId": "de943031-e334-4f0b-8d5c-907cfd37b467",
            # "booklisthistory": "8342179,7733959,8300491,7438304,9103303,6909214,7156650,6900090,8898194,8989529"
        }
        return request

    def parse(self, request, response):

        response = requests.get(request.url, proxies=sui_dao_proxies())
        print(response.text)
        response = feapder.Response(response)
        # response.encoding_errors = 'replace'
        title = response.xpath("//h1/text()").extract_first()
        # print(response.text)
        print(response)
        print(title)

if __name__ == "__main__":
    AirSpiderDemo(thread_count=1).start()
ruikai0103 commented 2 weeks ago

临时解决办法

response1 = requests.get(request.url, proxies=sui_dao_proxies())
response = feapder.Response(response1)
response.text = response1.text

发现乱码的时候手动 feapder的Response替换掉。